-
Notifications
You must be signed in to change notification settings - Fork 574
Try to force inlining of newSV_type (i -> I in embed.fnc) #23190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
When `Perl_newSV_type` became an inline function, the idea was that using it to create a specific type known at compile time should result in the call being completely inlined into the call site. So something like this would always be inlined by gcc/clang under default build settings: SV* mySV = newSV_type(SVt_PV) At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes. This commit changes the inline flag in embed.fnc from `i` ("please try to inline") to `I` ("always inline", where supported). This restores the intended behaviour. The more aggressive inlining flag wasn't originally specified out of caution if it causing the perl binary size to grow excessively. On the present codebase though, building with this flag and gcc 12 actually resulted in the binary size shrinking by 312 bytes.
Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior? |
Not sure off the top of my head. I wondered if it could have been 24c3369 but won't get time to do a before and after build before the weekend. |
Okay, let me re-phrase. How would we have previously known that inlining was happening -- and subsequently no longer was happening?
|
between
and blead's
perl541.dll b4 3.06 MB (3,210,240 bytes) 3,135kb MSVC 2022 x64 -O1, this commit in this PR exposed/created another bug that was hiding. Before this commit the byte sequence/C string After this commit I now have 15 callsites in perl541.dll that look like this
caused by this commit which should be reverted, or same thing, restore for no -DDEBUGGING perl builts, the non-static non-inline macro old code that disappeared in this commit This commit changed the MSVC inline cost analyzer/tree walker, to decide to not!!! inline, non-exported sym |
The reason why the wins and losses so random file to file, depends on how many befoe this commit my libperl had 23 copies of "all 17 types afterwards, 0 copies of Perl_forbid_outofblock_ops Perl_vivify_ref rough math (0x316-0x187)/2= 200 bytes for each"1 of 17" |
LTO is one the least dependable, least portable, and one of most "undefined" parts of any brand of a C compiler. LTO the feature doesn't exist outside of FANNG sponsored C compilers (Clang/Apple, MSVC, GCC [partially]). So anyone with a commercial Unix OS, or Tiny C, tough luck, 25-50 copies of "all 17 types" A compiler's/linkers LTO's byte stream output is stable for exactly 1 build number of that C compiler. Security people only care about "binary reproducible" with todays, C compiler/tool chain, not yesterdays, not tomorrow's night build of a C compiler, and that security concern is more about the final binary, or a large group of the final binaries, not being a back channel data exfiltration path between 1 persons/dev's personal box/build farm VMs and the general public. https://www.eff.org/fr/pages/list-printers-which-do-or-do-not-display-tracking-dots LTO isn't supposed to even exist on ELF systems in a .so file, b/c ELF sym interposition rule. Only the root process start up bin on ELF can be LTOed, according to tech specifications. LTO is always default off in all C compilers, any brand. The bytecode .o/.a/.lib/.a created during an LTO compile, can't be archived or distributed publicly (FOSS or clients). LTO byte code inside a .o has no forwards/backwards compat except with the same build number of the C compiler that made it. Perl only uses LTO b/c someone at P5P wrote and pushed to blead a patch for their favorite brand of C compiler. And that patch uses a 100% proprietary non-portable vendor specific flag to turn LTO on. If LTO works on your CC/OS/build env, awesome. Good for you and everyone else with the same platform permutation. But depending on LTO to fix BAD CODE and coding mistakes, Nope. I disagree with that. An optimization of a CC can change or come and go or break and get fixed 3 years later. Never write code assuming LTO/tail calling optimization WILL happen or else. Don't fix infinite stack recursion with -O2/-O3. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html Exactly which opcodes/bytes a same brand C compiler will spit 9-18 months from now is totally random. Each new Intel/AMD CPU chip die/microarchitecture (every 18 months) has micro benchmark able differences for all its basic/general purpose opcodes. Some opcodes get slower, some get fast, its random and not connected to MHZ or the machine code getting benchmarked. GVV and Clang are facing social drama about de-optimizing and about making Pentium 3 and Pentium 4 CPUs intentionally slow out of the box to improve speeds for >= Core 2, or de-optiming for Core 2 or older Pentium 4 but not "removing" or breaking" compat with those very out CPUs. Its no a SIGILL. Its just deoptimizng. Build flag exists to get back Pentium 3/P4 mode if an end user really wants maximum it back. There are other tickets Ive opened to fix this bug and cleaup this API to be correctly const foldable and not depending on sketchy permutations of LTO+OS kernel specifics+C compiler's author's random decisions going right. I have mixed feelings on this commit, it made libperl much bigger, but affect smalllish typical CPAM .dll'es, to be randomly bigger or smaller than before. A not that hard test could be written to check libper.exe/all seperate XS.dlls/.so'es for |
I did a build of recent blead with clang-21 with
So it does appear to be inlined most of the time. I'm more worried about whether the compiler optimizes based on the type of the resulting SV, ie. does it optimize away a type check for The other problem we've had with forced inline is build errors from the |
And -O2 on recent/modern late 2010s/2020s Mingw for Win32/Win65 GCCs was causing/is currently causing an unfixable SEGV nobody can diagnose, and stable perl was/has/is currently using -O1 on WinOS GCCs until further notice. I probably could fix whatever the assembly code/code gen bug is, or find the active open GCC ticket with the exactly bug in GCC, or figure out what magic The permanent fix for 17 individual I don't want to touch the ticket or write a blead/stable (same thing nowadays) patch for #22667 until a bunch of other small "break-out" "support" PRs get merged or approved into blead like The "final" patch for #22667 WILL cause a small amount of BBC breakage, since the 1 and only fix, involves doing this
and adding a new no-static, no-inline, exported from libperl IIRC there are 2 statements currently blead perl's repo, which do The BBC breakage/slight change to the public API docs of
CPAN XS modules that are doing |
Thanks for the feedback, all. It's good to know that inlining is still happening with clang-21. (I really think the performance improvement is worth it.) Maybe my distro's compilers are just comparatively long in the tooth. I won't pursue this PR further over the gcc
I'll try to do some experiments to see, though presumably that could vary between compilers and versions. I'd like to imagine that:
But with optimising compilers, who knows! c79fe2b (which IIRC had follow-ups) was in response to reasonable feedback received about the naming of I'm not clear that #22667 is the right or best approach. For example, what if I did the following, with clear comments?
It seems like that could address the main bloat concerns raised by bulk88. |
You can get similar info from gcc with I didn't see any cases in my casual testing I didn't find any misses, but I just did a full run and found a few missed cases:
Rough stats:
which is 78%-ish inlined. |
The original concept/whiteboard of spliting the mega "all 17 types" SV allocator into 17 tiny separate tiny allocators optimization is an excellent idea IMO. The optimization just needs to be implemented src code wise, a different way, than its implemented currently in stable/blead perl. The implementation of this optimization needs to be done at The current blead/stable perl implementation makes assumptions that all C compilers will perform https://en.wikipedia.org/wiki/Loop-invariant_code_motion between 2 different linker symbols (C function call bodies) or in other words, do loop invariant code motion through the OS/CPU's ABI standard, and through do loop invariant through the official C language declaration of a C lang function's prototype, in a .o file, and disobey what the developer typed in source code as the static function call's prototype. MSVC refuses to "slide to the left" arguments on the right side of an unused argument (in body of static C function). Instead MSVC just leaves a hole or leaves the CPU register with uninitialized/unknown content in the parent callers to the C static (won't emit the Asm opcode to move the int constant that was written in C src code to the register), and the static function never reads the unused volatile CPU register representing unused C lang incoming arg #1 or #arg 3, before reusing that cpu register for other things. IDK what GCC or WinClang GCC or WinClang MSVC will do in the same situation. Will they slide to the left all remaining right side variables when they LTO optimization out an unused incoming C stack (CPU reg) arg of a C static function, between it, and its callers? Some random reasons Ive guessed in my head why C prototype args can't be "slide to the left" by a C compiler, even with LTO.
Clang/Intel C def, maybe GCC , running with 80x25 chars of proprietary command line flags, will do some amazing LTO, just like i386 MSVC does, but thats a huge maint investment to turn on from a P5P side, and force those proprietary command line flags from clang/gcc/intel c, onto
I would vote AGAINST any "optimization" patch/PR that suddenly lands in the P5P bug queue, where the optimization, "messes up" address space so much, inside a perl.exe process, that the perl interp, executes ALL CPAN XS authors's C99 XSUBs, in a 2nd
click to expand, this is about C lang/C linker arch, and isn't about Perl's C codeAlso a possible reason MS forbid custom ABIs on Win64, is that on Win32 for i386, C backtraces are 100% impossible to get for a frozen process or a crash mem dump file, unless you have .pdb files, for all (every last one) .dlls in virtual address space (tough luck if have a commercial closed source non-MS created .dll in your perl.exe address space. On Win for i386 C++ exceptions/C89 exceptions (SEH), each frame must "register" a real extern C89ish C function with arbitrary machine code, with the Windows kernel, on how to destroy/unwind/reverse its C stack call frame. MS's PDB file format has static C struct analogs of those i386 stack destruction C89 functions for IDE C debugger purposes.By MS removing that dangerous/unstable design, and replacing it with the SysV solution for Win64 on x64/ARM64, is a massive improvement for mix-n-matching random .dll'es from random authors made with random versions of MSVC and WinGCC and Borland. Static unwind tables look like CSS language or make language, not a Turing complete language like Perl/C/Java/Python. Any usermode or windows kernel mode library can effortless O(1)/O(n) inspect the C stack all the way back to "OS thread start call frame", just like a C debugger can, if it really needs to for some very unusual reason. GCC DWARF files and .pdbs are "wishlist" on Win64, not "punchlist" for a dev using a C debugger, after MS copy pasted SysV' spec's design for Win on AMD64. Ancient WinNT for SPARC/MIPS/Alpha/PowerPC, did not have any "Microsoft" ABI. The ancient MS Windows SDKs for those 4 CPUs, just included a link/postal address telling developers to read the man pages and do whatever it says in the native commercial Unix OS's man pages covering sys calls and do whatever that commercial Unix OS's man pages for No production code/production apps/kernel drivers ever actually do this on Windows OS, because its pointless, unless they intentionally want to be C debuggers or produce their own automated "crash" reports without "help" from the Windows Kernel/MS official public APIs for 3rd party C debuggers. One exception to the last sentence, 99% of commercial video game engines will walk/backtrace the C stack at runtime constantly looking for gaming cheats/cracks in address space, and video game engines constantly enumerate or monitor their user mode address space for "hostile" .dlls or "hostile" executable memory pages, inserted remotely into the process by using MS public API C debugger function calls by the person sitting at the keyboard/mouse. Random topic, the last paragraph reminded me of, Chrome/Chromium/OSX probably DIY this algorithm and I know for sure Win10 kernel does this since MS says it does. If a Win 10 OS machine is on wall/mains power, the kernel when it does a "mark and sweep" of your address space, to look for "cold" 4KB pages you aren't dereferencing for a long time, nowadays Win10 will constantly GZIP groups of your "cold" 4KB pages from physical ram, into other free physical ram, and then wait for a while longer, before sending those GZIPed 4KB malloc memory to your SSD drive. GBs of phy RAM is very cheap nowadays. Burning out SSD drives is not cheap. I throw out my Transcends/Sandisk SSDs every 2 years on my dev/C compiler machines nowadays. Usually my SSDs start to crawl at 5-10MBPS on SATA at the end of their life, or 1 time ever (learned my lesson), the SSD outright dies loosing all data. So yeah, GZIP 4KB memory pages from phy ram to phy ram, before sending them to SSD is very common nowadays.. Because Linux/Windows GZIP "cold" user mode malloc pages instead of using the paging file nowadays. Which P5P Op Tree Guru wants to implement GZIPing cold PP sub's optree structs at runtime, to free up phy ram and virtual paging file private (malloc) bytes at /bin/perl process runtime? I hope everyone knows Perl's optree structs can be stored in HW const read only .dll/.so memory and be executed by the runloop and an unmodified threaded perl engine at runtime. I've tried it, and got a PP sub stored in a .dll, that just does But!!!! BUT!!!!! anything more complicated than My C compiler/all C compilers will generate tiny 1 CPU instruction long, C functions that jump from the "current .dll" to libperl.dll, I can't make the C compiler at compile time, scribble in the same C function pointer memory address that libperl.dll uses to scribble in the addresses Perl_pp_foo_whatever() into the optree. All CPAN XS libraries that want to create their own OP structs, will use P5P's official
Yeah, assuming LTO can legally/safely change, the declared by the C developer, with his keyboard, C/C++ official ASCII string name of a C static symbol, is a very big request from multiple different engineering teams. Doing optimizations to C autos backed by the C stack, that never had
Correct, when I mean revert, its not going to be done using git's revert GUI button, but writing new code to bring back the old design pattern. Where is the chat log/ticket/ML thread about that commit if there is one? I'd like to see the design/engineering chit chat done from the time of that commit if its available.
Pointless as an optimization, since const declared global structs between .so'es/.dll'es can't CC const fold. Some CCs/ABIs/OSes may also have a policy, click to expand, this is about C lang/C linker arch, and isn't about Perl's C code"Const for your .so , not for me (my .so), haha, its MY API you are consuming bro"In reality, ISO C, or ELF/PE/.dll/.so global data symbols can be intentionally misdeclared with But I would like to see that struct made an exported symbol from libperl, so CPAN XS authors can inspect the official, authoritative copy of that C struct at runtime, without using a
Just generic design talk by me, I want to be dead sure my final patch, whatever it looks like in the end, truly matches and does the same thing as described in the untouched original bodies_by_type struct. There will be some kind of test code added to XS::APITest by me probably. There is a risk of me or anyone, making a copy paste keyboard error, during the process of turning that "2D matrix" of data, into independent isolated "rows" of C preprocessor "data", with their IDE. I already tried writing regex-es to "parse" that "2D matrix" of data, stored in
Yeah, ^^^ is 10 or less CPU op codes. Splicing the Arena linked list, takes more op codes than Its also super super important, for all of libperl's public api newSVxxxxv() functions to be refactored, so all the "writes" done with macros in src code, to head and body fields, can get const folded together by the CC. Not throwing extern C funcs into the newSVxxxxv() function family, in random places, in a random order, in those P5P maintained performance critical SV allocators/ctors. Another very very low priority IMO, thing to do, to the newSVxxxxv() function family, is refactor them, so they alloc/unlink from the arena, the SV BODY BEFORE THE SV HEAD. and fill in the SV body, before they unlink the head and fill in the head. This way, newSVxxxxv() function family, isn't balancing 2-4 incoming argument variables, plus 4-10 C auto variables, at the same time, post liveness analysis in registers/C stack slots, but only balancing a max of 4 (win64) or 6 (linux) C variables at any point, or maybe 8 vars at a time (Android ARM32/64 ???), and therefore doesn't need to CTOR and DTOR formal ABI C stack slots in machine code in prolog/epilog, saving maybe 6-14 CPU opcodes in the process. This refactor in this paragraph is very low on my priority list. click to expand, this is about C lang/C linker arch, and isn't about Perl's C codeUsing registers vs c stack slots have been benchmarked to be 100% identical on x86/x64 for atleast 15 years. Athlon 64/Intel Core 2 and newer, have 32 or more general purpose registers in real life, not 8/16 registers like the formal ISA.pdf and official intel/amd assembly_reference.pdf says. The front end assembly/mach code decoder instantly and silently assigns/maps upto 8/16 different offsets read/writes to C stack slots, automatically to physical CPU registers, no memory address, no write back cache, no L1 cache. After 8 C stack slots, the CPU starts using L1 cache lines, with a larger pico second latency. Therefore IDC too much about filling in SV body completly before unlinking SV head and filling it it optimization. Too many src line changes, for too little runtime effect. So very low priority in my book.In any case x64 SysV/Win64 prolog/epilogs are so "simple" as a byte pattern b/c OS ABI reasons, they are probably a single CPU opcode inside the silicon at runtime anyways. On ARM32/64, entering and leaving a C function calls, and following redzone/sig handler OS rules, on ARM32/64, is ONE SINGLE purpose made, 4 byte long, ARM CPU instruction, not (~3-~10)+(~3-~10) ops like on Intel/AMD. Those were some very smart ppl back in 1985 who did this for ARM platform. Ask C/Asm devs what features THEY want, and what common tasks THEY want to do/C compilers often do. Not design a CPU by asking a mathematician or RF electrical engineer or industrial welding robot arm firmware engineer, first on what features they want to see from a CPU. A C compiler and C SDK for car's combustion engine's spark plug timings or anti-skid logic, has 0% in common with desktop/mobile/UI/business/financial/entertainment/multimedia C compilers (aka Windows/Linux/OSX SDKs). I have successfully written and ran C code for the chip on a welding robot arm 2x in my life. The welding robot arm was a K-12 kids educational toy sadly, but the SDK/C compiler used was the real one used for the real thing in a car factory. I got a >= 95 on those assignments. knowing Perl/JS beforehand was super helpful :-) Currently in blead perl, all the I've recently tried out a new design algo on libperl.dll that reduces But there is a severe runtime cost to my optimization, if I'm going to make a PR with a comittable src code example of above ^^^ optimization somewhere else in the perl VM tonight probably. I didn't recognize this defect for 15 years until a few months ago, but both Reini's B::CC.pm and Richard, independently came to the same conclusion Perl 5 VM's public/private C API, has a less than perfect or a slightly flawed balance, of what the C API does locally in machine code with Macros/Static Inlines, and what is offloaded to heavy weight full features funcs in libperl.so.dll. This balance can be slowly tweaked by P5P over time without affecting CPAN XS modules, since P5's C API is 100% macros and function calls since 5.000 alpha. P5's public C API, intelligently, from day 1, has never ever publically allowed 3rd party raw direct C struct member/field manipulation. 3rd party people == CPAN XS authors. I never recognized Reini's and Richards opinions myself, since WinPerl's malloc()/file system/ring 0 kernel calling layers are always what comes out "ontop" on every instrumenting/profiling tool I've used on blead perl personally. And Perl's millisecond TTFB response time isn't something that affects my day job's private business Perl code. They use Perl Tk/Win32::GUI for local apps, and some many seconds long, or 3-5 minutes long ETL operations. JSON/protobuf/YAML/Storable serializers aren't used at all. A little bit of data lives in CSV files. Almost all data is accessed through Perl XSUBs, that bind against closed source C .dlls. TIESCALAR and JSON/Sereal/Storable just aren't tools I personally use or ever benchmarked. PS macros click to expand, the C code of Perl 1-Perl 4 and what Perl 1-4 interps did in C src is TLDRPerl 5's C API is much better than other C APIs. Perl 5 VM's API is not MS's Win32 API that can't edit its C structs ever again for any reason for the next 100 years, or POSIX's API, that only allows resorting the memory order/.h src code order, of C struct fields, but the ascii identifiers or their functionality are frozen forever for the next 1000 years. POSIX has no concept of fast/slow coding concepts, or plain-old-data dereferences vs high feature, heavy weight, "data properties" with dynamic getter/setter methods.These choices were made unintentionally probably by larry when he typed in Perl 1-Perl 3's C code. Larry probably read the C code for /bin/bash or /bin/csh for inspiration, when he typed Perl 1-4's C code. He definetly didn't read the C src code of /bin/cc and /bin/as for inspiration. Perl 5's C code and architecture looks nothing like the modern internals of GCC/Clang/.NET VM/FF/Safari/Chrome VMs. But someone did retrofit Perl 5 with a SSA/RTL layer that lives in https://github.com/Perl/perl5/blob/blead/op.c which is peculiar, since the logic that lives in https://github.com/Perl/perl5/blob/blead/op.c really should be part of https://github.com/Perl/perl5/blob/blead/perly.h https://github.com/Perl/perl5/blob/blead/perly.c https://github.com/Perl/perl5/blob/blead/perly.act https://github.com/Perl/perl5/blob/blead/perly.tab and machine generated, not hand written optimizing phases. But regardless, Perl >=5.004 ??? has a distinct parse/compile/SSA/RTL/write final machine code (OP* structs) block diagram which is how most/all other languages except for line-by-line languages like basic/bash/csh/ksh and certain SW programing languages created in the 1950s early 1960s, that don't have any compile or parse phase.. click to expand, off topic comp sci talk SO styleThe ASCII as typed executes on silicon HW as if its machine code. Not a joke, in my private life, I do interact with another entity's AJAX server, that ultimately is a front end, for this Github SW repo and its ancient programing language :-DIn any case, white board talk about further optimizations after fixing the "all 17 types static inline newSV_type() isn't getting CC split into 17 x 1 type tiny newSV_type() variants" bug, is better left after bug <<<<<----- gets fixed once and for eternity in blead and the final "17 -> 17 x 1 funcs" fix whatever it looks like is declared "bug free"/stable after a couple weeks/months by the Perl community. |
I'm currently working on a fix for |
Yeah I was thinking along the same lines. A |
I had thought about per-type functions originally. Can't remember if someone dissuaded me or I just figured that so many new functions would be unwelcome. If that's an approach that people actually are happy with and bulk88 is already working on it, am happy to leave him to it. I did have a PR nearly ready to either:
Should I continue working on that? |
My code is getting rid of that __ FILE __ __ LINE __ __ FUNCTION __ assert()-on bug in stable/no -DDEBUGGING perl builds, which is deeply intertwined with PERL_MEM_LOG/DEBUG_LEAKING_SCALARS and 3-6 different my_getenv() wrapper APIs/layers of C that do the exact same thing with all 3-6 layers stacked on each other at runtime. This would be my 2nd or 3rd attempt at in 6 months, at cleaning up WinPerl's anything and everything to do with Kernel32.dll's
Yes if u have the code to slim down Its never going away by name or concept IMO. But I have some ideas how Most/all CPAN XS .dlls on blead import the
No Perl_more_bodies() is used by various STATIC_INLINEs in perl headers, and therefore all cpan XS .so/.dlls import it, even tho its upto the CC if it wants to emit a small static C func, 1 per .so/.dll or inline the machine code. Its not very hot, and the table logic and Newx() and linked list building needs to be inside libperl, not 100s of copies over 100s of shlibs. Perl's arena pools are 100% Private API and don't exist for CPAN XS's viewpoint (and can be disabled as a exotic debugging build flag and Newx() is used instead). Abs byte size of perl's any Cstructs/body structs isnt "public api", nor CPU paging size, or how many slots/body structs in a row are obtained from
Only bodyless RV IV and NVs and *AV CTORs on certain macros/code paths, and maybe limited to PERL_CORE, are the only clean inline ctors in my opinion and MSVC -O1's opinion.
yes, your ideas rn arent anywhere close code wide to what im working on, my side non finished draft if someone is bored is attached |
statics can't be found by ascii string name with a text editor after compiling. 17 exported nearly identically name functions would be annoying to look at in a IDE/nm/dev tool constantly. Most of the SV types are obsolete Perl 4 holdovers, like FORM and IO objects, SVt_INVLIST and SVt_PVOBJ have questionable technical rational. An array of 17 fn ptrs, all of them normally branch-less/fn-call-less, or 1 exported from libperl dispatcher func with current symbol name I prefer the 17 static inlines concept more, since it full matches the original whiteboard problem, which was "allocate 1000 or 10,000 new SVIVs into a About, SVt_INVLIST and SVt_PVOBJ, they probably could be the same type I think. They are both inaccessible/invisible from PP state, not
|
IO is very much in active use:
I have use formats, but it's been quite a long time now. |
When
Perl_newSV_type
became an inline function, the idea was that using it to create a specific type known at compile time should result in the call being completely inlined into the call site.So something like this would always be inlined by gcc/clang under default build settings:
At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.
This commit changes the inline flag in embed.fnc from
i
("please try to inline") toI
("always inline", where supported). This restores the intended behaviour.The more aggressive inlining flag wasn't originally specified out of caution if it causing the perl binary size to grow excessively. On the present codebase though, building with this flag and gcc 12 actually resulted in the binary size shrinking by 312 bytes.
Perl_newSV
is a good function to disassemble to check if thePerl_newSV_type
callis inlined or not.