-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Textual approach can work #7
base: master
Are you sure you want to change the base?
Conversation
Thanks for exploring this! These are some promising results indeed. To answer your question, yes, size reduction is the main goal here. I want to ship potentially even a lot more sets of headers than these along with the Zig compiler, and want to keep the installation size down to a reasonable amount. At the same time, I don't want to ship files that are compressed with a generic file compression strategy because I want to keep these properties for the Zig installation:
Furthermore, I think that providing a set of "universal headers" as a standalone deliverable could be something interesting to offer in and of itself. Other projects might like to benefit from it, and perhaps contribute to this project in return. This patch is making me think, maybe we could go with this textual approach you have explored, but with an additional preprocessing step to try to normalize them even more? For example, you're already normalizing the whitespace, but what if we tried to normalize the order of top-level declarations, too? To flesh out the idea a little more, start with some more straightforward normalizations:
Finally, the complicated part. Mark every declaration as having a set of input names and output names. For example, Our goal is to perform a sort of the declarations for each file, such that the sorted result has minimal textual differences across header files. For a given header file, for each named declaration, we will create two sort keys: primary, and then the name itself. For the primary, create a bitset with a bit for every header file of that name (e.g. a bit for every stdio.h file in the headers), populate that bit based on whether or not the name exists. This bitset, interpreted as an unsigned integer, is now used as the primary sort key.
The idea here is that after this aggressive normalization is performed, textual approach could work even better, especially when combining less similar headers. What do you think about this idea? |
Thanks for the explanation! I think I understand what you are going for and agree with those installation properties. The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header. I think there's also some utility to having the universal header be somewhat readable by us in the future debugging things, but that worry is more vague. My guess is that a few of those normalizations (like removing comments and empty lines) will be both easy to verify and give us most of the savings. Let me implement the easy ones, think more about the harder ones, and come back with more info. |
On that note, @marler8997 has been working on a verification tool. |
251d33d
to
1d5a1fb
Compare
Progress update. Added stripping comments and newlines. Can now compile all headers together:
We are down from 1.8G to 150M. And of that, about 1/3 is just the overhead of the huge extra #if lines. They look like this:
I'm thinking of doing an extra processing step at the end to collapse some of those. So it could detect that the #if has all of the The code could generate an extra header file that looks like this:
And then add that header to the top of every file I guess. Would that make sense as a way to organize these? I'll go down that route unless you have a better plan. Just eyeballing the headers, I don't think doing more sophisticated minimization will buy much savings, and they all require a much deeper understanding of C/C++ syntax. Let's see if this is good enough. |
Here's a fun point of comparison: Zig's current installation size for libc headers is 91M. So you have already achieved increasing the number from 56 to 451 different libc header sets, with only 1.6x the size. I think this is within shippable range, however, I still think we can do even better. As for reducing large sets of Maybe we can have a look at some common, large sets of if conditions and see if there are any similar simplifications that can be made? |
This is awesome. The clause reduction problem is text-book SAT, but you can probably get 90% close to the best theoretical solution just by applying some good old-fashioned human knowledge. I do think that reordering definitions (what Andrew suggested a few comments ago) might also prove to be a worthwhile reduction (as it acts like a full removal of a #if clause), but I personally am not fully familiar with how much more complicated it would make things. |
Down to 103M. Take a look at src/textdiff/reductions.zig - each entry lists all the headers that go together and the last line is what we replace them with. Some headers are in multiple reductions. With this set, there aren't that many remaining huge #if statments (judging from The replacements themselves are completely UNTESTED, and I made up some of them. Just want to see if this is the right path before digging into exactly what they should be. There will be a large cleanup pass if this works. What do you think? |
Wow, incredible! That's only a 13% increase from status quo! I'll make some time this week to tinker with this. |
Thanks! I've added src/textdiff/HOWTO.txt to help get you going. |
This is mindblowing/amazing, depending on how one sees things. I am watching this closely now. Good job, @david-vanderson ! Somewhat unrelated question before this gets rolled out: how far back do we want to go with regards to glibc? RHEL7 ships with 2.17, the oldest "supported" distro I am aware of; it's EOL will be on 2024-06-30. Next one is RHEL 8, which ships glibc 2.28. |
❤️ from the past: ziglang/zig#15573
|
This implements a textual approach to making universal headers. We diff the different versions and add #if/#endif blocks. You end up with stuff like:
@andrewrk talked about universal headers being needed to solve some problem, but I don't understand what you are looking for besides size reduction. Does this strategy help you?
For size reduction, combining these headers together saves about 90%
Combining less similar headers gives less savings (2/3 for these)
Look at
src/textdiff/example.bash
for how I'm using/testing it.One of the nice things about this strategy is testability - we can pull a specific version from the universal headers and we should get back the same files as that original version (modulo whitespace).
The rest of this description is some details about how it works and what problems I ran into.
Strategy is to maintain 2 work-in-progress files for each header. One has all the lines from all the versions. The other keeps the version list for each line in the first file. Here's how it looks in vimdiff
To add a new version to the work-in-progress we
Once all the versions have been folded in,
outputHeaders
goes through the work-in-progress files and adds #if/#endif blocks where the versions change.Why the context hash stuff?
Since we are adding our own #if/#endif blocks, we need to ensure ours don't interleave with existing ones. It's okay if everything is nested, but if version A had
#endif
and version B had#endif /* new comment */
then we can't combine in the naive way:The solution here is to preprocess the file and prepend each line with a hash to keep #if/#endif blocks from separate versions from becoming entangled. This can also happen with comments, because you can't put functioning #if/#endif in the middle of /* */ comments.
Some random things I did not expect:
#ifndef
because the comment is giving an example (so the code needs to track if it is inside a comment or not)#
andif
Testing
Right now each #if that we add gets an extra _ZIG_UH_TEST added to the end. This allows
testHeaders
to only evaluate those #if/#endif blocks and keep all the rest.I've run every header dir successfully in at least one batch.
Stuff that could be improved