Skip to content

Try to force inlining of newSV_type (i -> I in embed.fnc) #23190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

richardleach
Copy link
Contributor

When Perl_newSV_type became an inline function, the idea was that using it to create a specific type known at compile time should result in the call being completely inlined into the call site.

So something like this would always be inlined by gcc/clang under default build settings:

SV* mySV = newSV_type(SVt_PV)

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

This commit changes the inline flag in embed.fnc from i ("please try to inline") to I ("always inline", where supported). This restores the intended behaviour.

The more aggressive inlining flag wasn't originally specified out of caution if it causing the perl binary size to grow excessively. On the present codebase though, building with this flag and gcc 12 actually resulted in the binary size shrinking by 312 bytes.

Perl_newSV is a good function to disassemble to check if the Perl_newSV_type call
is inlined or not.


  • This set of changes does not require a perldelta entry.

When `Perl_newSV_type` became an inline function, the idea was that
using it to create a specific type known at compile time should result
in the call being completely inlined into the call site.

So something like this would always be inlined by gcc/clang under default
build settings:

    SV* mySV = newSV_type(SVt_PV)

At some point in the past couple of dev cycles, this inlining seems to
have stopped happening. Possibly additions just tipped it over a size
threshold within the C compiler optimization passes.

This commit changes the inline flag in embed.fnc from `i` ("please try
to inline") to `I` ("always inline", where supported). This restores
the intended behaviour.

The more aggressive inlining flag wasn't originally specified out of
caution if it causing the perl binary size to grow excessively. On
the present codebase though, building with this flag and gcc 12
actually resulted in the binary size shrinking by 312 bytes.
@jkeenan
Copy link
Contributor

jkeenan commented Apr 10, 2025

When Perl_newSV_type became an inline function, the idea was that using it to create a specific type known at compile time should result in the call being completely inlined into the call site.

So something like this would always be inlined by gcc/clang under default build settings:

SV* mySV = newSV_type(SVt_PV)

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior?

@richardleach
Copy link
Contributor Author

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior?

Not sure off the top of my head.

I wondered if it could have been 24c3369 but won't get time to do a before and after build before the weekend.

@jkeenan
Copy link
Contributor

jkeenan commented Apr 10, 2025

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior?

Not sure off the top of my head.

Okay, let me re-phrase. How would we have previously known that inlining was happening -- and subsequently no longer was happening?

I wondered if it could have been 24c3369 but won't get time to do a before and after build before the weekend.

@leonerd @tonycoz can you take a look? Thanks.

@bulk88
Copy link
Contributor

bulk88 commented Apr 11, 2025

between

SHA-1: 7e2ed756ebc5b1c55afe3e6ba5be424db74ef6e4
* Try to force inlining of newSV_type (i -> I in embed.fnc)

and blead's

SHA-1: 1b6aef6d141ef8e718ff72add8ed8653f2c7847f
* Update Extutils-MakerMaker to CPAN version 7.74

perl541.dll b4 3.06 MB (3,210,240 bytes) 3,135kb
perl541.dll after 3.07 MB (3,226,624 bytes) 3,151kb

MSVC 2022 x64 -O1, this commit in this PR exposed/created another bug that was hiding. Before this commit the byte sequence/C string "C:\\sources\\perl5\\sv_inline.h" didnt exist in my libperl.dll.

After this commit I now have 15 callsites in perl541.dll that look like this

sv = Perl_new_sv(my_perl, "C:\\sources\\perl5\\sv_inline.h", 378, "Perl_newSV_type");

caused by this commit which should be reverted, or same thing, restore for no -DDEBUGGING perl builts, the non-static non-inline macro old code that disappeared in this commit

c79fe2b

This commit changed the MSVC inline cost analyzer/tree walker, to decide to not!!! inline, non-exported sym Perl_new_sv() , when previously it did always inline away Perl_new_sv() and its accidental assert-on-in-production arguments away. After this commit "bumped" the inline priority of static symbol Perl_newSV_type() and Perl_newSV_type() after this commit its embedded in many many more places (still researching details) , Perl_new_sv() went from inlined away to a separate C function call. everywhere, with 3 useless unused args passed by all callers and ignored by the body of Perl_new_sv() .

@bulk88
Copy link
Contributor

bulk88 commented Apr 11, 2025

--- pb4164vc22rlb4/t.txt        2025-04-10 20:21:00.483738900 -0400
+++ pb4164vc22rlaf/t.txt        2025-04-10 20:17:25.172423800 -0400
@@ -4,7 +4,7 @@
  Directory of C:\pb4164vc22rlaf\bin

 perl541.dll
-               1 File(s)      3,210,240 bytes
+               1 File(s)      3,226,624 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\attributes

@@ -14,7 +14,7 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\B

 B.dll
-               1 File(s)         81,920 bytes
+               1 File(s)         80,896 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Compress\Raw\Bzip2

@@ -34,12 +34,12 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\Data\Dumper

 Dumper.dll
-               1 File(s)         32,256 bytes
+               1 File(s)         32,768 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Devel\Peek

 Peek.dll
-               1 File(s)         18,944 bytes
+               1 File(s)         18,432 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Digest\MD5

@@ -119,7 +119,7 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\Hash\Util

 Util.dll
-               1 File(s)         22,528 bytes
+               1 File(s)         21,504 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Hash\Util\FieldHash

@@ -159,7 +159,7 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\Opcode

 Opcode.dll
-               1 File(s)         21,504 bytes
+               1 File(s)         20,992 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\PerlIO\encoding

@@ -179,12 +179,12 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\POSIX

 POSIX.dll
-               1 File(s)         76,288 bytes
+               1 File(s)         75,776 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\re

 re.dll
-               1 File(s)        640,000 bytes
+               1 File(s)        641,024 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\SDBM_File

@@ -199,7 +199,7 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\Storable

 Storable.dll
-               1 File(s)         93,184 bytes
+               1 File(s)         97,280 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Sys\Hostname

@@ -224,7 +224,7 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\Time\Piece

 Piece.dll
-               1 File(s)         23,552 bytes
+               1 File(s)         24,064 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Unicode\Collate

@@ -239,7 +239,7 @@
  Directory of C:\pb4164vc22rlaf\lib\auto\Win32

 Win32.dll
-               1 File(s)         48,128 bytes
+               1 File(s)         47,616 bytes

  Directory of C:\pb4164vc22rlaf\lib\auto\Win32API\File

@@ -247,5 +247,5 @@
                1 File(s)         64,512 bytes

      Total Files Listed:
-              49 File(s)     11,160,576 bytes
-               0 Dir(s)  182,723,043,328 bytes free
+              49 File(s)     11,179,008 bytes
+               0 Dir(s)  182,725,914,624 bytes free

The reason why the wins and losses so random file to file, depends on how many .o/.obj files each with that static vis newSV_type were linked into 1 .dll. Also 1 full sized newSV_type , handling all 17 SV types replaced by idk, 3-10, tiny instances, either as real C funcs with a formal stack frame, or inlined into many place, but all the callers (CPAN XS, Core XS, or Core interp C) are using basic daily RC types, bodyless SV/RV or just PV/PVMG basic stuff, so those big blobs of machine code are fractions of "all 17 types" full featured newSV_type .

befoe this commit my libperl had 23 copies of "all 17 types newSV_type, 0 copies of Perl_new_sv(), 0 copies of S_new_body(), 50 unique addresses copies of the bodies_by_type const static struct.

afterwards, 0 copies of newSV_type, 1 copy of Perl_new_sv(), 3 copies of S_new_body() and 50 unique addresses copies of the bodies_by_type const static struct.

Perl_forbid_outofblock_ops
b4 0xd2
after 0x20d

Perl_vivify_ref
b4 0x187
after 0x316

rough math (0x316-0x187)/2= 200 bytes for each"1 of 17" newSV_type inlined and also does the arena body free list unlinking/allocing.

@bulk88
Copy link
Contributor

bulk88 commented Apr 11, 2025

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior?

LTO is one the least dependable, least portable, and one of most "undefined" parts of any brand of a C compiler. LTO the feature doesn't exist outside of FANNG sponsored C compilers (Clang/Apple, MSVC, GCC [partially]). So anyone with a commercial Unix OS, or Tiny C, tough luck, 25-50 copies of "all 17 types" newSV_type() in your libperl.

A compiler's/linkers LTO's byte stream output is stable for exactly 1 build number of that C compiler. Security people only care about "binary reproducible" with todays, C compiler/tool chain, not yesterdays, not tomorrow's night build of a C compiler, and that security concern is more about the final binary, or a large group of the final binaries, not being a back channel data exfiltration path between 1 persons/dev's personal box/build farm VMs and the general public.

https://www.eff.org/fr/pages/list-printers-which-do-or-do-not-display-tracking-dots

LTO isn't supposed to even exist on ELF systems in a .so file, b/c ELF sym interposition rule. Only the root process start up bin on ELF can be LTOed, according to tech specifications.

LTO is always default off in all C compilers, any brand. The bytecode .o/.a/.lib/.a created during an LTO compile, can't be archived or distributed publicly (FOSS or clients). LTO byte code inside a .o has no forwards/backwards compat except with the same build number of the C compiler that made it.

Perl only uses LTO b/c someone at P5P wrote and pushed to blead a patch for their favorite brand of C compiler. And that patch uses a 100% proprietary non-portable vendor specific flag to turn LTO on.

If LTO works on your CC/OS/build env, awesome. Good for you and everyone else with the same platform permutation. But depending on LTO to fix BAD CODE and coding mistakes, Nope. I disagree with that. An optimization of a CC can change or come and go or break and get fixed 3 years later. Never write code assuming LTO/tail calling optimization WILL happen or else. Don't fix infinite stack recursion with -O2/-O3.

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Exactly which opcodes/bytes a same brand C compiler will spit 9-18 months from now is totally random. Each new Intel/AMD CPU chip die/microarchitecture (every 18 months) has micro benchmark able differences for all its basic/general purpose opcodes. Some opcodes get slower, some get fast, its random and not connected to MHZ or the machine code getting benchmarked.

GVV and Clang are facing social drama about de-optimizing and about making Pentium 3 and Pentium 4 CPUs intentionally slow out of the box to improve speeds for >= Core 2, or de-optiming for Core 2 or older Pentium 4 but not "removing" or breaking" compat with those very out CPUs. Its no a SIGILL. Its just deoptimizng. Build flag exists to get back Pentium 3/P4 mode if an end user really wants maximum it back.

There are other tickets Ive opened to fix this bug and cleaup this API to be correctly const foldable and not depending on sketchy permutations of LTO+OS kernel specifics+C compiler's author's random decisions going right.

I have mixed feelings on this commit, it made libperl much bigger, but affect smalllish typical CPAM .dll'es, to be randomly bigger or smaller than before.

A not that hard test could be written to check libper.exe/all seperate XS.dlls/.so'es for
byte pattern 10h, 10h, 10h, 0C3h, 0DD0. If that byte array is seen more than once in libperl.so or random.xs, it means this newSV() optimization failed on that particular CC

@tonycoz
Copy link
Contributor

tonycoz commented Apr 11, 2025

When Perl_newSV_type became an inline function, the idea was that using it to create a specific type known at compile time should result in the call being completely inlined into the call site.
So something like this would always be inlined by gcc/clang under default build settings:

SV* mySV = newSV_type(SVt_PV)

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior?

I did a build of recent blead with clang-21 with -fsave-optimization-record and ran my analysis script against the result.

Perl_newSV_type  75.8% 369/487
Perl_newSV_type_mortal  100.0% 96/96

So it does appear to be inlined most of the time.

I'm more worried about whether the compiler optimizes based on the type of the resulting SV, ie. does it optimize away a type check for SVt_PVAV if the SV was created with newSV_type(SVt_PVAV)?

The other problem we've had with forced inline is build errors from the -Og optimization option with gcc.

@bulk88
Copy link
Contributor

bulk88 commented Apr 11, 2025

When Perl_newSV_type became an inline function, the idea was that using it to create a specific type known at compile time should result in the call being completely inlined into the call site.
So something like this would always be inlined by gcc/clang under default build settings:

SV* mySV = newSV_type(SVt_PV)

At some point in the past couple of dev cycles, this inlining seems to have stopped happening. Possibly additions just tipped it over a size threshold within the C compiler optimization passes.

Would that point be bisectable? If so, is there a one-liner that would trigger the change in behavior?

I did a build of recent blead with clang-21 with -fsave-optimization-record and ran my analysis script against the result.

Perl_newSV_type  75.8% 369/487
Perl_newSV_type_mortal  100.0% 96/96

So it does appear to be inlined most of the time.

I'm more worried about whether the compiler optimizes based on the type of the resulting SV, ie. does it optimize away a type check for SVt_PVAV if the SV was created with newSV_type(SVt_PVAV)?

The other problem we've had with forced inline is build errors from the -Og optimization option with gcc.

And -O2 on recent/modern late 2010s/2020s Mingw for Win32/Win65 GCCs was causing/is currently causing an unfixable SEGV nobody can diagnose, and stable perl was/has/is currently using -O1 on WinOS GCCs until further notice. I probably could fix whatever the assembly code/code gen bug is, or find the active open GCC ticket with the exactly bug in GCC, or figure out what magic --fno-very-long-special-alt-compliance-mode-gcc-only-flag fixes it, but never tried b/c of time limitations. IIRC Perl + GCC -O3 is impossible https://stackoverflow.com/questions/2958633/gcc-strict-aliasing-and-horror-stories

The permanent fix for newSV_type is documented in stalled ticket #22667

17 individual static decl-ed C symbols for 17 SV types need to exist at .i stage. The fix must be at a .c/.i level. Nothing else will fix the problem. -O1-4, use nightly clang, or use an approved [my] platform are incorrect tools to use to solve a defect for public SW.

I don't want to touch the ticket or write a blead/stable (same thing nowadays) patch for #22667 until a bunch of other small "break-out" "support" PRs get merged or approved into blead like
#22880 and #22662

The "final" patch for #22667 WILL cause a small amount of BBC breakage, since the 1 and only fix, involves doing this

#define newSV_type(_type) S_newSV_type_##_type(aTHX)

and adding a new no-static, no-inline, exported from libperl Perl_newSV_typex(pTHX) or Perl_newXV_type(pTHX) function that handles all 17 types.

IIRC there are 2 statements currently blead perl's repo, which do sv = newSV_type(c_auto_var_u32_or_u8);. Just 2 statements. And the moment sv = newSV_type(c_auto_var_u32_or_u8); is written, the C compiler must output in machine code the "all 17 types" variant of static inline newSV_type() and hopefully ignores the inline tag that the C dev/P5P wrote in source code.

The BBC breakage/slight change to the public API docs of newSV_type() that is only takes an "enum" or "constant" and not a 0-2^32 range U32 integer, is super easy to fix, in the 1% or less of modules that will be broken. Just switch newSV_type(u8_type_var); to newXV_type(u8_type_var); and recompile/reship the XS module. or write

sv = u8_type == SVt_PVHV ? newSV_type(SVt_PVHV) :  u8_type == SVt_PVAV ? newSV_type(SVt_PVHV)  : (croak("This de-serializer sub in this serialization CPAN XS module was feed an invalid data structure filled with corrupt bytes"),NULL);

CPAN XS modules that are doing newSV_type(u8_var) are probably data serializers and they have no business, and can't possibly be wanting to inflate types SVt_PVCV, SVt_PVFM, SVt_PVIO, SVt_PVOBJ, etc, from a 1 byte U8 that originated from stdio/a disk file.

@richardleach richardleach added the do not merge Don't merge this PR, at least for now label Apr 11, 2025
@richardleach
Copy link
Contributor Author

Thanks for the feedback, all.

It's good to know that inlining is still happening with clang-21. (I really think the performance improvement is worth it.) Maybe my distro's compilers are just comparatively long in the tooth.

I won't pursue this PR further over the gcc -Og problems.

I'm more worried about whether the compiler optimizes based on the type of the resulting SV, ie. does it optimize away a type check for SVt_PVAV if the SV was created with newSV_type(SVt_PVAV)?

I'll try to do some experiments to see, though presumably that could vary between compilers and versions. I'd like to imagine that:

  • If the call is inlined, and nothing happens in between the call and type check, the check would be optimized away. (Compiler will see "SvTYPE(sv) = SVt_whatever", followed by "if (SvTYPE(sv) == SVt_whatever)".)
  • If the call isn't inlined, or there's e.g. some function call between SV creation and type check, the compiler plays it safe.

But with optimising compilers, who knows!

c79fe2b (which IIRC had follow-ups) was in response to reasonable feedback received about the naming of new_sv. I'm not sure that simple reversion is going to happen.

I'm not clear that #22667 is the right or best approach. For example, what if I did the following, with clear comments?

  • Move the body_details tables back into sv.c and mark it extern
  • Hardcode the relevant values into newSV_type - but on DEBUGGING builds there is an assert statement that each of the hardcoded values matches what's actually in the relevant table.
  • Move Perl_new_sv back into sv.c as S_new_sv (as it was), have DEBUG_LEAKING_SCALARS call into that function, inline the non-debugging bits into newSV_type.
    if (PL_sv_root)
        uproot_SV(sv);
    else
        sv = Perl_more_sv(aTHX);
    SvANY(sv) = 0;
    SvREFCNT(sv) = 1;
    SvFLAGS(sv) = 0;

It seems like that could address the main bloat concerns raised by bulk88.

@tonycoz
Copy link
Contributor

tonycoz commented Apr 11, 2025

It's good to know that inlining is still happening with clang-21. (I really think the performance improvement is worth it.) Maybe my distro's compilers are just comparatively long in the tooth.

You can get similar info from gcc with -fopt-info-inline-all, but it's output as compilation warnings.

I didn't see any cases in my casual testing I didn't find any misses, but I just did a full run and found a few missed cases:

perl.c:2576:26: missed:   not inlinable: S_parse_body.isra/860 -> Perl_newSV_type/256, --param max-inline-insns-single limit reached
perl.c:2584:30: missed:   not inlinable: S_parse_body.isra/860 -> Perl_newSV_type/256, --param max-inline-insns-single limit reached
perl.c:4750:21: missed:   not inlinable: S_init_postdump_symbols/340 -> Perl_newSV_type/256, --param max-inline-insns-single limit reached
perl.c:4748:20: missed:   not inlinable: S_init_postdump_symbols/340 -> Perl_newSV_type/256, --param max-inline-insns-single limit reached
op.c:10717:18: missed:   not inlinable: Perl_newMYSUB/459 -> Perl_newSV_type/257, --param max-inline-insns-single limit reached
op.c:12025:18: missed:   not inlinable: Perl_newXS_len_flags/468 -> Perl_newSV_type/257, --param max-inline-insns-single limit reached
op.c:12083:14: missed:   not inlinable: Perl_newSTUB/469 -> Perl_newSV_type/257, --param max-inline-insns-single limit reached
gv.c:97:18: missed:   not inlinable: Perl_gv_add_by_type/293 -> Perl_newSV_type/256, --param max-inline-insns-single limit reached

Rough stats:

$ grep 'missed.* -> Perl_newSV_type/' build.txt | wc -l
97
$ grep 'Inlined Perl_newSV_type/' build.txt | wc -l
357

which is 78%-ish inlined.

@bulk88
Copy link
Contributor

bulk88 commented Apr 12, 2025

Thanks for the feedback, all.

It's good to know that inlining is still happening with clang-21. (I really think the performance improvement is worth it.)

The original concept/whiteboard of spliting the mega "all 17 types" SV allocator into 17 tiny separate tiny allocators optimization is an excellent idea IMO. The optimization just needs to be implemented src code wise, a different way, than its implemented currently in stable/blead perl. The implementation of this optimization needs to be done at regen.pl stage or C preprocessor stage, not relying on "undefined behavior" inside Foo Vendor's C linker's LTO engine to do extreme/dangerous/not ISO C compliant optimizations.

The current blead/stable perl implementation makes assumptions that all C compilers will perform https://en.wikipedia.org/wiki/Loop-invariant_code_motion between 2 different linker symbols (C function call bodies) or in other words, do loop invariant code motion through the OS/CPU's ABI standard, and through do loop invariant through the official C language declaration of a C lang function's prototype, in a .o file, and disobey what the developer typed in source code as the static function call's prototype.

MSVC refuses to "slide to the left" arguments on the right side of an unused argument (in body of static C function). Instead MSVC just leaves a hole or leaves the CPU register with uninitialized/unknown content in the parent callers to the C static (won't emit the Asm opcode to move the int constant that was written in C src code to the register), and the static function never reads the unused volatile CPU register representing unused C lang incoming arg #1 or #arg 3, before reusing that cpu register for other things.

IDK what GCC or WinClang GCC or WinClang MSVC will do in the same situation. Will they slide to the left all remaining right side variables when they LTO optimization out an unused incoming C stack (CPU reg) arg of a C static function, between it, and its callers?

Some random reasons Ive guessed in my head why C prototype args can't be "slide to the left" by a C compiler, even with LTO.

  • the C prototypes of static functions are always turned into hard typing C++ mangled names, then hashed, the C/C++ linker must always see/know the pre-LTO/pre-optimization C++ mangled name of a C lang static function, in order to dedupe C static functions in the final binary. A C++ mangled

  • Win64 for x64's C/C++/any language ABI is 90% a straight up copy paste from the Unix SysV .pdf specification, which uses static unwind or static description tables to unwind the C stack for C++ try/catch exceptions, and where and how to execute user mode C lang POSIX signal handlers (or WinNT APCs) on user mode OS threads and user mode C stacks. See https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build and https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html

  • Doing a "shift to the left" causes weirdness/complexity in modern parallel multicore C compilers. The body of the static function, the body of the caller of the static function, and the C structs/array unwind table for the static, were generated in parallel on the 3 different cores. Its probably easier to leave the C variables uninitialized, and drop out the =/mov ops at the very very very end of the block diagram of LTO inside the linker, the very very last step is, converting the C structs which represent the RTL/SSA IR byte code to x65 machine code, which is a very simplistic O(1) algorithm that doesn't look forwards or backwards through "global state" to do its job. Recomputing the line by line c src code "hit counts" on C autos, and redoing var liveness analysis in the C static func body and ALL its callers, to do "shift to left" is much more complicated than telling the assembler stage in the C LTO linker to NOOP/drop out the mov instruction in the caller frames once its "discovered" at the end of the block diagram, that the callee C static func has an unused incoming argument.

  • "shift to the left" will cause UI crazyness in GDB/MSVC C debuggers, instead of "optimized out", you won't even see the variable at all thinking there is a C src code/C preprocessor macro bug you created that you need to fix, this UI crazyness would be a limitation of the DWARF/PDB C symbol file debugger format. The C structs don't allow -1 for "C stack argument position" to indicate a variable that is optimized away, they only allow C struct fields "start line/start CPU op" and "end line/end CPU op" to be used to indicate an optimized out variable, you still have to fill in the "C stack argument position" field, and you can't use constant -1 or 0 for hysterical reasons

  • MS decided that the purity/validity/usefullness of C debugger backtraces in crashes/SEGVs without .pdb files available for certain parts of the C stack and certain C call frames, is of much much higher priority than allowing MSVC or any Win64 CC to do "__randomcall" calling conventions on its brand new x64 Windows platform (Server 2003 day 1 release). MSVC 2003 and newer, in i386 mode, with -O1 and LTO turned on, with that permutation >= VC 2003 does 100% perfect amazing witchcraft voodoo in its LTO engine regarding calling conventions, and what registers are volatile/non-volatile, and in what order ISO C lang args are private passed between 2 C function calls, that qualified for LTO, because they are not exported symbols. MSVC in i386 mode has an amazing near perfect LTO layer. That i386-only LTO engine and what it does couldn't be transferred to MSVC running in x64/Win64 and MSVC for x64's LTO engine is very simplistic. IDK why from a comp sci/engineering view.

Clang/Intel C def, maybe GCC , running with 80x25 chars of proprietary command line flags, will do some amazing LTO, just like i386 MSVC does, but thats a huge maint investment to turn on from a P5P side, and force those proprietary command line flags from clang/gcc/intel c, onto EU::MM users and CPAN XS authors. Its a large maint investment, because

  • P5P/metaconfig supports ancient GCCs/FooUnixOS CCs
  • Most P5P C devs aren't also MPEG-4 video decoder assembly devs
  • suddenly flipping a non-selective at a .c src code level, process global, CC cmd line flag, that causes masive CC compile time errors, or runtime SEGV breakage for CPAN XS authors talking to their non-perl aware 3rd party C librarys, is unacceptable.
  • Trying to be Rust/Zig that don't link against GNU's/FSF's Glibc library and only directly talk to the linux kernel through its public API. That is impossible in perl ecosystem to pull off with 100% volunteers running P5P and without >= 7 digits $$$ US dollar budget. Zig's devs code directly against ntdll.dll's Nt* functions in their STDIO layer. Note, ntdll.dll's Nt* functions are exactly identical to coding against Linux's https://github.com/torvalds/linux/blob/v4.17/arch/x86/entry/syscalls/syscall_64.tbl#L11 public api.
  • Remember P5 is written in, in theory, in both the C99 and C++ languages, and must link/interop with all other normal C89/C++ code made by 3rd parties.

I would vote AGAINST any "optimization" patch/PR that suddenly lands in the P5P bug queue, where the optimization, "messes up" address space so much, inside a perl.exe process, that the perl interp, executes ALL CPAN XS authors's C99 XSUBs, in a 2nd perl_xs_ldr,exe process through a unix domain socket, because that new "optimization" deallocated the OS provided C stack for some reason and requires unmapping glibc.so/kernel32.dll from the perl process for some reason. Just b/c a web browser does that, isn't a reason for Perl 5 to do it.

  • Adding "recompile Grub and the Linux Kernel and the whole Debian .iso from src code" as part of Perl's README is def out of scope for P5P LOL. P5P isn't Google or Apple who compile their own OS and their own BIOS/EFI images from src code.
click to expand, this is about C lang/C linker arch, and isn't about Perl's C code Also a possible reason MS forbid custom ABIs on Win64, is that on Win32 for i386, C backtraces are 100% impossible to get for a frozen process or a crash mem dump file, unless you have .pdb files, for all (every last one) .dlls in virtual address space (tough luck if have a commercial closed source non-MS created .dll in your perl.exe address space. On Win for i386 C++ exceptions/C89 exceptions (SEH), each frame must "register" a real extern C89ish C function with arbitrary machine code, with the Windows kernel, on how to destroy/unwind/reverse its C stack call frame. MS's PDB file format has static C struct analogs of those i386 stack destruction C89 functions for IDE C debugger purposes.

By MS removing that dangerous/unstable design, and replacing it with the SysV solution for Win64 on x64/ARM64, is a massive improvement for mix-n-matching random .dll'es from random authors made with random versions of MSVC and WinGCC and Borland. Static unwind tables look like CSS language or make language, not a Turing complete language like Perl/C/Java/Python. Any usermode or windows kernel mode library can effortless O(1)/O(n) inspect the C stack all the way back to "OS thread start call frame", just like a C debugger can, if it really needs to for some very unusual reason. GCC DWARF files and .pdbs are "wishlist" on Win64, not "punchlist" for a dev using a C debugger, after MS copy pasted SysV' spec's design for Win on AMD64. Ancient WinNT for SPARC/MIPS/Alpha/PowerPC, did not have any "Microsoft" ABI. The ancient MS Windows SDKs for those 4 CPUs, just included a link/postal address telling developers to read the man pages and do whatever it says in the native commercial Unix OS's man pages covering sys calls and do whatever that commercial Unix OS's man pages for /bin/as says you have to do, MS probably just licensed chunks of the closed source /bin/as and dropped those .a files into the MSVC .exe with very little code.

No production code/production apps/kernel drivers ever actually do this on Windows OS, because its pointless, unless they intentionally want to be C debuggers or produce their own automated "crash" reports without "help" from the Windows Kernel/MS official public APIs for 3rd party C debuggers. One exception to the last sentence, 99% of commercial video game engines will walk/backtrace the C stack at runtime constantly looking for gaming cheats/cracks in address space, and video game engines constantly enumerate or monitor their user mode address space for "hostile" .dlls or "hostile" executable memory pages, inserted remotely into the process by using MS public API C debugger function calls by the person sitting at the keyboard/mouse.

Random topic, the last paragraph reminded me of, Chrome/Chromium/OSX probably DIY this algorithm and I know for sure Win10 kernel does this since MS says it does. If a Win 10 OS machine is on wall/mains power, the kernel when it does a "mark and sweep" of your address space, to look for "cold" 4KB pages you aren't dereferencing for a long time, nowadays Win10 will constantly GZIP groups of your "cold" 4KB pages from physical ram, into other free physical ram, and then wait for a while longer, before sending those GZIPed 4KB malloc memory to your SSD drive. GBs of phy RAM is very cheap nowadays. Burning out SSD drives is not cheap. I throw out my Transcends/Sandisk SSDs every 2 years on my dev/C compiler machines nowadays. Usually my SSDs start to crawl at 5-10MBPS on SATA at the end of their life, or 1 time ever (learned my lesson), the SSD outright dies loosing all data. So yeah, GZIP 4KB memory pages from phy ram to phy ram, before sending them to SSD is very common nowadays..

Because Linux/Windows GZIP "cold" user mode malloc pages instead of using the paging file nowadays. Which P5P Op Tree Guru wants to implement GZIPing cold PP sub's optree structs at runtime, to free up phy ram and virtual paging file private (malloc) bytes at /bin/perl process runtime? I hope everyone knows Perl's optree structs can be stored in HW const read only .dll/.so memory and be executed by the runloop and an unmodified threaded perl engine at runtime.

I've tried it, and got a PP sub stored in a .dll, that just does sub { print "hello"; } to execute without a SEGV on an unmodified threaded!!!! perl53X.dll. I did write the PP optree to a .c file, then compiling the .dll, then tricking the CV* and the runloop into executing that PP optree stored in a RO .dll, it works.

But!!!! BUT!!!!! anything more complicated than sub { print "hello"; } or sub { return 5; }, SEGVed perl. Perl's pp_foo() functions do direct pointer equality comparisons of BASEOP field OP* (*op_ppaddr)(pTHX); instead of testing PERL_BITFIELD16 op_type:9;, and there is no way I can const "burn in" C function pointer memory addresses from libperl.dll into a foreign/different .dll, while keeping those C struct/OP tree structs, in that foreign .dll, as C const decl memory.

My C compiler/all C compilers will generate tiny 1 CPU instruction long, C functions that jump from the "current .dll" to libperl.dll, I can't make the C compiler at compile time, scribble in the same C function pointer memory address that libperl.dll uses to scribble in the addresses Perl_pp_foo_whatever() into the optree. All CPAN XS libraries that want to create their own OP structs, will use P5P's official op *(__fastcall *PL_ppaddr[426])(interpreter *) symbol to get the function pointer, not use Foo OS'es .so/.dll/.o/.a linker and its APIs. I can fix the FooOS ld.so/FooCC /bin/ld versus op *(__fastcall *PL_ppaddr[426])(interpreter *) problem on WinOS, while keeping inter-perl-process WinKernel COW memory sharing between Perl's "B::DllByteLoader.pm"'s generated .dll's that store "compiled PP optrees", but I know the code will never be accepted into stable perl's .git repo, and nobody can probably maintain it but me, and 1 person bus factor code like that shouldn't be inside the official P5P interpreter.

I won't pursue this PR further over the gcc -Og problems.

I'm more worried about whether the compiler optimizes based on the type of the resulting SV, ie. does it optimize away a type check for SVt_PVAV if the SV was created with newSV_type(SVt_PVAV)?

I'll try to do some experiments to see, though presumably that could vary between compilers and versions. I'd like to imagine that:

  • If the call is inlined, and nothing happens in between the call and type check, the check would be optimized away. (Compiler will see "SvTYPE(sv) = SVt_whatever", followed by "if (SvTYPE(sv) == SVt_whatever)".)
  • If the call isn't inlined, or there's e.g. some function call between SV creation and type check, the compiler plays it safe.

But with optimising compilers, who knows!

Yeah, assuming LTO can legally/safely change, the declared by the C developer, with his keyboard, C/C++ official ASCII string name of a C static symbol, is a very big request from multiple different engineering teams. Doing optimizations to C autos backed by the C stack, that never had & operator used on them, is basic simple safe stuff. Rewriting the inter-function call ABI between 2 seperate C function calls, regardless of Linux's visibility feature, or ISO C's static feature, is a huge complicated request. Remember, a full complete C lang inline, if the CC actually does the inline, is absolutly identical to using the C preprocessor and doing a copy paste of .c src code. The inline C function symbol, simply doesn't exist anymore in machine code in the final binary. But a inline C func, that the CC refused to inline, has to follow almost all the same rules as if it was an extern decl in the first place. Same as using C auto keyword register as a C lang coding error safety tool in modern C code.

c79fe2b (which IIRC had follow-ups) was in response to reasonable feedback received about the naming of new_sv. I'm not sure that simple reversion is going to happen.

Correct, when I mean revert, its not going to be done using git's revert GUI button, but writing new code to bring back the old design pattern.

Where is the chat log/ticket/ML thread about that commit if there is one? I'd like to see the design/engineering chit chat done from the time of that commit if its available.

I'm not clear that #22667 is the right or best approach. For example, what if I did the following, with clear comments?

  • Move the body_details tables back into sv.c and mark it extern

Pointless as an optimization, since const declared global structs between .so'es/.dll'es can't CC const fold. Some CCs/ABIs/OSes may also have a policy, const declared global data vars CAN NOT always be be const folded, b/c https://man7.org/linux/man-pages/man2/mprotect.2.html exists at runtime.

click to expand, this is about C lang/C linker arch, and isn't about Perl's C code "Const for your .so , not for me (my .so), haha, its MY API you are consuming bro"

In reality, ISO C, or ELF/PE/.dll/.so global data symbols can be intentionally misdeclared with const in public api headers, but disassembling the producer .so/.dll shows those C global data symbols live in RW .data image section, and producer can modify that global data var at runtime if it wants to, because the secret in-house C headers, didn't declare that variable with the const tag. I've seen this in production grade Win32 DLLs a couple times in my life. The public SDK for 3rd parties says that C data symbol is const, but that is 100% a lie, and I can memcpy() change that global variable at runtime if I know the secret its actually RW memory at runtime.

But I would like to see that struct made an exported symbol from libperl, so CPAN XS authors can inspect the official, authoritative copy of that C struct at runtime, without using a .h copy of it, created when B::Foo and Devel::Foo got downloaded from CPAN and compiled. Its free/no cost to do a OS/C lang export on that read only struct from libperl for unofficial CPAN XS usage. XS_HANDSHAKE reasons, or interp core git blame/smoke reasons, where a B::Foo XS module wants to statically inspect that table, using OS mmap/dl_open(), from 5 different stable perl versions, without creating 5 different perl processes and redirecting STDOUT.

  • Hardcode the relevant values into newSV_type - but on DEBUGGING builds there is an assert statement that each of the hardcoded values matches what's actually in the relevant table.

Just generic design talk by me, I want to be dead sure my final patch, whatever it looks like in the end, truly matches and does the same thing as described in the untouched original bodies_by_type struct. There will be some kind of test code added to XS::APITest by me probably. There is a risk of me or anyone, making a copy paste keyboard error, during the process of turning that "2D matrix" of data, into independent isolated "rows" of C preprocessor "data", with their IDE.

I already tried writing regex-es to "parse" that "2D matrix" of data, stored in bodies_by_type in an automated regen.pl way, into CPP macros, I gave up after 15 minutes trying/prototypeing a regexp that could parse bodies_by_type. Reason was there is no point on writing a final patch without community input first, that I need to hear first.

  • Move Perl_new_sv back into sv.c as S_new_sv (as it was), have DEBUG_LEAKING_SCALARS call into that function, inline the non-debugging bits into newSV_type.
    if (PL_sv_root)
        uproot_SV(sv);
    else
        sv = Perl_more_sv(aTHX);
    SvANY(sv) = 0;
    SvREFCNT(sv) = 1;
    SvFLAGS(sv) = 0;

It seems like that could address the main bloat concerns raised by bulk88.

Yeah, ^^^ is 10 or less CPU op codes. Splicing the Arena linked list, takes more op codes than SvANY(sv) = 0; SvREFCNT(sv) = 1; SvFLAGS(sv) = 0; which is literally 1 or 2 SSE/AVX CPU move instructions to do in -O1/-O2.

Its also super super important, for all of libperl's public api newSVxxxxv() functions to be refactored, so all the "writes" done with macros in src code, to head and body fields, can get const folded together by the CC. Not throwing extern C funcs into the newSVxxxxv() function family, in random places, in a random order, in those P5P maintained performance critical SV allocators/ctors.

Another very very low priority IMO, thing to do, to the newSVxxxxv() function family, is refactor them, so they alloc/unlink from the arena, the SV BODY BEFORE THE SV HEAD. and fill in the SV body, before they unlink the head and fill in the head. This way, newSVxxxxv() function family, isn't balancing 2-4 incoming argument variables, plus 4-10 C auto variables, at the same time, post liveness analysis in registers/C stack slots, but only balancing a max of 4 (win64) or 6 (linux) C variables at any point, or maybe 8 vars at a time (Android ARM32/64 ???), and therefore doesn't need to CTOR and DTOR formal ABI C stack slots in machine code in prolog/epilog, saving maybe 6-14 CPU opcodes in the process. This refactor in this paragraph is very low on my priority list.

click to expand, this is about C lang/C linker arch, and isn't about Perl's C code Using registers vs c stack slots have been benchmarked to be 100% identical on x86/x64 for atleast 15 years. Athlon 64/Intel Core 2 and newer, have 32 or more general purpose registers in real life, not 8/16 registers like the formal ISA.pdf and official intel/amd assembly_reference.pdf says. The front end assembly/mach code decoder instantly and silently assigns/maps upto 8/16 different offsets read/writes to C stack slots, automatically to physical CPU registers, no memory address, no write back cache, no L1 cache. After 8 C stack slots, the CPU starts using L1 cache lines, with a larger pico second latency. Therefore IDC too much about filling in SV body completly before unlinking SV head and filling it it optimization. Too many src line changes, for too little runtime effect. So very low priority in my book.

In any case x64 SysV/Win64 prolog/epilogs are so "simple" as a byte pattern b/c OS ABI reasons, they are probably a single CPU opcode inside the silicon at runtime anyways. On ARM32/64, entering and leaving a C function calls, and following redzone/sig handler OS rules, on ARM32/64, is ONE SINGLE purpose made, 4 byte long, ARM CPU instruction, not (~3-~10)+(~3-~10) ops like on Intel/AMD. Those were some very smart ppl back in 1985 who did this for ARM platform. Ask C/Asm devs what features THEY want, and what common tasks THEY want to do/C compilers often do. Not design a CPU by asking a mathematician or RF electrical engineer or industrial welding robot arm firmware engineer, first on what features they want to see from a CPU.

A C compiler and C SDK for car's combustion engine's spark plug timings or anti-skid logic, has 0% in common with desktop/mobile/UI/business/financial/entertainment/multimedia C compilers (aka Windows/Linux/OSX SDKs). I have successfully written and ran C code for the chip on a welding robot arm 2x in my life. The welding robot arm was a K-12 kids educational toy sadly, but the SDK/C compiler used was the real one used for the real thing in a car factory. I got a >= 95 on those assignments.

knowing Perl/JS beforehand was super helpful :-)

Currently in blead perl, all the sv_magic()/add_taint()/gv_stash()/gv_fetch()/mallocs()/memset()s func calls are pretty much in a random order of operations, between Perl CPP macros writing in default values ontop of uninit-ed bytes, into the fresh head and body structs. Stuff like 4-8 different = C ops, doing read/modify/write assignments to U32 sv->sv_flags, to fill in ONE struct member, isn't well written code. For good readability reasons, the C dev didn't constant fold the hex integer literals by hand in source code, but the C dev also wrote the code, in such a way, that the C compile can't constant fold all those integer literals either, because 3 separate calls to malloc() are separating the .c flags = flags | SVf_ANYTHING; statements.

I've recently tried out a new design algo on libperl.dll that reduces Perl_more_sv(aTHX) style branches, to exactly 1 CPU opcode on Win32/Win64 with tailcall weirdness and Perl C API "specialness", and ELF/Win64 redzone specialness (google leaf function). and symbol S_new_sv() becomes approx 4-5 CPU instructions big on AMD64.

But there is a severe runtime cost to my optimization, if Perl_more_sv() executes more than 3-5x in a perl processes lifetime. But since Perl's arena pools never ever shrink and are never ever GCed, Perl_more_sv() flat lines eventually and will never ever be called again. And yes, a perl apache webserver process, will not execute Perl_more_sv(), for weeks/months/years at a time if my logic is correct.

I'm going to make a PR with a comittable src code example of above ^^^ optimization somewhere else in the perl VM tonight probably.

I didn't recognize this defect for 15 years until a few months ago, but both Reini's B::CC.pm and Richard, independently came to the same conclusion Perl 5 VM's public/private C API, has a less than perfect or a slightly flawed balance, of what the C API does locally in machine code with Macros/Static Inlines, and what is offloaded to heavy weight full features funcs in libperl.so.dll. This balance can be slowly tweaked by P5P over time without affecting CPAN XS modules, since P5's C API is 100% macros and function calls since 5.000 alpha. P5's public C API, intelligently, from day 1, has never ever publically allowed 3rd party raw direct C struct member/field manipulation. 3rd party people == CPAN XS authors.

I never recognized Reini's and Richards opinions myself, since WinPerl's malloc()/file system/ring 0 kernel calling layers are always what comes out "ontop" on every instrumenting/profiling tool I've used on blead perl personally. And Perl's millisecond TTFB response time isn't something that affects my day job's private business Perl code. They use Perl Tk/Win32::GUI for local apps, and some many seconds long, or 3-5 minutes long ETL operations. JSON/protobuf/YAML/Storable serializers aren't used at all. A little bit of data lives in CSV files. Almost all data is accessed through Perl XSUBs, that bind against closed source C .dlls. TIESCALAR and JSON/Sereal/Storable just aren't tools I personally use or ever benchmarked.

PS macros dXSTARG/SvSETMAGIC(), really really need to learn about their new brothers/sisters called overload.pm/SvAMAGIC()/SVt_PVLV, and learn how to directly write into a SVt_PVLV lvalue object on a C/ABI level. I've reverse engineered how to do that from a CPAN XS normal XSUB enviromental back in 2013 but I probably lost the XS code on how exactly I did it by now. It was an attempt to use overload.pm to deal with an unresponsive/abandoned/difficult maintainer of another CPAN module. I soon realized overload.pm as a tool to deal with a very old problematic CPAN module, is the wrong tool. The right tool is to just privately fork that CPAN module and s/CPAN::Name/Local::CPAN::Name/g; it.

click to expand, the C code of Perl 1-Perl 4 and what Perl 1-4 interps did in C src is TLDR Perl 5's C API is much better than other C APIs. Perl 5 VM's API is not MS's Win32 API that can't edit its C structs ever again for any reason for the next 100 years, or POSIX's API, that only allows resorting the memory order/.h src code order, of C struct fields, but the ascii identifiers or their functionality are frozen forever for the next 1000 years. POSIX has no concept of fast/slow coding concepts, or plain-old-data dereferences vs high feature, heavy weight, "data properties" with dynamic getter/setter methods.

These choices were made unintentionally probably by larry when he typed in Perl 1-Perl 3's C code. Larry probably read the C code for /bin/bash or /bin/csh for inspiration, when he typed Perl 1-4's C code. He definetly didn't read the C src code of /bin/cc and /bin/as for inspiration. Perl 5's C code and architecture looks nothing like the modern internals of GCC/Clang/.NET VM/FF/Safari/Chrome VMs.

But someone did retrofit Perl 5 with a SSA/RTL layer that lives in https://github.com/Perl/perl5/blob/blead/op.c which is peculiar, since the logic that lives in https://github.com/Perl/perl5/blob/blead/op.c really should be part of https://github.com/Perl/perl5/blob/blead/perly.h https://github.com/Perl/perl5/blob/blead/perly.c https://github.com/Perl/perl5/blob/blead/perly.act https://github.com/Perl/perl5/blob/blead/perly.tab and machine generated, not hand written optimizing phases.

But regardless, Perl >=5.004 ??? has a distinct parse/compile/SSA/RTL/write final machine code (OP* structs) block diagram which is how most/all other languages except for line-by-line languages like basic/bash/csh/ksh and certain SW programing languages created in the 1950s early 1960s, that don't have any compile or parse phase..

click to expand, off topic comp sci talk SO style The ASCII as typed executes on silicon HW as if its machine code. Not a joke, in my private life, I do interact with another entity's AJAX server, that ultimately is a front end, for this Github SW repo and its ancient programing language :-D

https://github.com/WorldVistA/VistA/blob/master/Packages/Auto%20Replenishment%20Ward%20Stock/Patches/PSGW_2.3_18/PSGW-2P3_SEQ-18_PAT-18.KIDS

In any case, white board talk about further optimizations after fixing the "all 17 types static inline newSV_type() isn't getting CC split into 17 x 1 type tiny newSV_type() variants" bug, is better left after bug <<<<<----- gets fixed once and for eternity in blead and the final "17 -> 17 x 1 funcs" fix whatever it looks like is declared "bug free"/stable after a couple weeks/months by the Perl community.

@bulk88
Copy link
Contributor

bulk88 commented Apr 12, 2025

  • Hardcode the relevant values into newSV_type - but on DEBUGGING builds there is an assert statement that each of the hardcoded values matches what's actually in the relevant table.
  • Move Perl_new_sv back into sv.c as S_new_sv (as it was), have DEBUG_LEAKING_SCALARS call into that function, inline the non-debugging bits into newSV_type.
    if (PL_sv_root)
        uproot_SV(sv);
    else
        sv = Perl_more_sv(aTHX);
    SvANY(sv) = 0;
    SvREFCNT(sv) = 1;
    SvFLAGS(sv) = 0;

It seems like that could address the main bloat concerns raised by bulk88.

I'm currently working on a fix for https://github.com/Perl/perl5/commit/c79fe2b42ae2a540552f87251aa0e36a060dd584 (which IIRC had follow-ups) was in response to reasonable feedback received about the naming of new_sv. I'm not sure that simple reversion is going to happen. The final patch to fix/clean up "instrument/log 3 things" code, DEBUG LEAKING SCALARS and especially PERL_MEM_LOG, which are 2 seperate build config options, that are unrelated to each other, that PR will not be trivial, neither are currently compileable on WinPerl, and IDK how LinPerl with -DPERL_MEM_LOG, can even compile right now in stable perl and blead perl, if LinPerl since 5.36 uses Win32 style GCC visibility feature.

@Leont
Copy link
Contributor

Leont commented Apr 12, 2025

The original concept/whiteboard of spliting the mega "all 17 types" SV allocator into 17 tiny separate tiny allocators optimization is an excellent idea IMO. The optimization just needs to be implemented src code wise, a different way, than its implemented currently in stable/blead perl.

Yeah I was thinking along the same lines. A newSV_PV() would be trivial to inline.

@richardleach
Copy link
Contributor Author

I had thought about per-type functions originally. Can't remember if someone dissuaded me or I just figured that so many new functions would be unwelcome.

If that's an approach that people actually are happy with and bulk88 is already working on it, am happy to leave him to it.

I did have a PR nearly ready to either:

  • Make Perl_more_bodies take just the type as a single argument, and giving it the logic to figure out the necessary body and arena sizes,
    or
  • Add a Perl_more_bodies_lookup function in sv.c for use by newSV_type - or successors - that looks up the sizes and then calls Perl_more_bodies.

Should I continue working on that?

@bulk88
Copy link
Contributor

bulk88 commented Apr 15, 2025

I had thought about per-type functions originally. Can't remember if someone dissuaded me or I just figured that so many new functions would be unwelcome.

If that's an approach that people actually are happy with and bulk88 is already working on it, am happy to leave him to it.

My code is getting rid of that __ FILE __ __ LINE __ __ FUNCTION __ assert()-on bug in stable/no -DDEBUGGING perl builds, which is deeply intertwined with PERL_MEM_LOG/DEBUG_LEAKING_SCALARS and 3-6 different my_getenv() wrapper APIs/layers of C that do the exact same thing with all 3-6 layers stacked on each other at runtime. This would be my 2nd or 3rd attempt at in 6 months, at cleaning up my_getenv() and all its imposters in p5p/.git. Somewhere stopping no threads miniperl.exe from locale serialization/getenv() serialization is part of it. And adding CC time/CPP macro time const Wide/UTF16 C strings to WinPerl + PerlEnv_getenvpvs() (or similar named) was my original goal/task/problem 6-9 months ago.

WinPerl's anything and everything to do with %ENV in C or PP, any and all APIs and interp .c files, including psuedo-fork's/CPHost's layer, have severe code rot from Win95 era. Enumerating the WinPerl's OS level malloc() "free" areas in a paused runloop, shows, %ENV in C and/or PP, libperl or CPAN XS or PP .pms, is the number # 1 "former consumer" of malloc() bytes/blocks at all times (anywhere in the runloop) in a WinPerl process. Atleast peak >>= 2; if not peak >>= 4; mem reduction is possible from the current case.

Kernel32.dll's GetEnviromentalVariableA in Win7 is alot less efficient than it was in WinXP/2003 or older. But Kernel32.dll's GetEnviromentalVariableW/RtlQueryEnvironmentVariable are hyper overkill optimized by MS and are not a problem if called directly just speak Wide. RtlQueryEnvironmentVariable uses every last HW and comp sci trick for speed. They don't compute a hash number, too slow.

I did have a PR nearly ready to either:

  • Make Perl_more_bodies take just the type as a single argument, and giving it the logic to figure out the necessary body and arena sizes,

Yes if u have the code to slim down
void * Perl_more_bodies(pTHX_ const svtype sv_type, const size_t body_size, const size_t arena_size); to void * Perl_more_bodies(pTHX_ const svtype sv_type) please PR / push it, ive never tried to touch that particular fn in a private draft/WIP. Perl_more_bodies is low priority for me in my head since its at the "end" after "all" sv.c/sv.h refactor PRs are merged to blead (if they are).

Its never going away by name or concept IMO. But I have some ideas how void * Perl_more_bodies(pTHX_ const svtype sv_type) could get "removed" or optimized away but they are too far thinking. Getting rid of arg 2 and arg 3 is someones 1st step.

Most/all CPAN XS .dlls on blead import the Perl_more_bodies symbol, so it def needs to be as small as possible at call sites. Perl_more_bodies isnt "hot". It asymptotes eventually the longer a perl process lives and will never be called again.

or

  • Add a Perl_more_bodies_lookup function in sv.c for use by newSV_type - or successors - that looks up the sizes and then calls Perl_more_bodies.

No Perl_more_bodies() is used by various STATIC_INLINEs in perl headers, and therefore all cpan XS .so/.dlls import it, even tho its upto the CC if it wants to emit a small static C func, 1 per .so/.dll or inline the machine code. Its not very hot, and the table logic and Newx() and linked list building needs to be inside libperl, not 100s of copies over 100s of shlibs. Perl's arena pools are 100% Private API and don't exist for CPAN XS's viewpoint (and can be disabled as a exotic debugging build flag and Newx() is used instead). Abs byte size of perl's any Cstructs/body structs isnt "public api", nor CPU paging size, or how many slots/body structs in a row are obtained from Newx(), or is PerlMemShared_malloc() today? how about PerlMemParse_malloc()? The bookie is open and taking wagers for build configs!!!

Perl_more_bodies()/Perl_more_sv() both peak-off and are never called again in most perl workloads. perl arenas can't shrink or get GCed. But on an academic white board, Perl_more_bodies()/Perl_more_sv(), have 100s and 100s of callers (just below 1000) in libperl if truly 100% inlined. They need to be as microscopic once inlined into their callers, just like Perl_newSViv() is tiny at its call sites (3 void*s, iv, myperl, fn ptr).

Only bodyless RV IV and NVs and *AV CTORs on certain macros/code paths, and maybe limited to PERL_CORE, are the only clean inline ctors in my opinion and MSVC -O1's opinion. newSV_type(SVt_PVAV) is the only "with body" type in blead that isnt a disaster zone on the inside post CC optimization and the only "with body" type ctor that is not a forest of extern C function calls mixed with macros/ptr derefs that can't fold.

Should I continue working on that?

yes, your ideas rn arent anywhere close code wide to what im working on, my side non finished draft if someone is bored is attached
getenv_mem_log_refactor.patch

@bulk88
Copy link
Contributor

bulk88 commented Apr 15, 2025

I had thought about per-type functions originally. Can't remember if someone dissuaded me or I just figured that so many new functions would be unwelcome.

statics can't be found by ascii string name with a text editor after compiling. 17 exported nearly identically name functions would be annoying to look at in a IDE/nm/dev tool constantly. Most of the SV types are obsolete Perl 4 holdovers, like FORM and IO objects, SVt_INVLIST and SVt_PVOBJ have questionable technical rational. An array of 17 fn ptrs, all of them normally branch-less/fn-call-less, or 1 exported from libperl dispatcher func with current symbol name Perl_newSV_type() with a large switch statement for all 17 branchless ctors, The 17 branchless ctors are statics inside sv.c not static inlines. those 2 are better solutions IMO than 17 new exported symbols approach.

I prefer the 17 static inlines concept more, since it full matches the original whiteboard problem, which was "allocate 1000 or 10,000 new SVIVs into a SV** array ASAP" inside some CPAN XS author's disk/wire de-serializer module. Optimizing Perl_newSV_type() is useless for PP/optree code. They have their pre-made PADVAR arrays, and recursion just makes another mirror image of the same array and the unused depth/recursion levels stay around for rest of proc lifetime AFAIK. The optree has special awareness of when it can and can't do a PADSWAP/STEAL, CPAN XSUBs will never had that knowledge. P5P still doesn't let them use 255 COW, and SvLEN()=0/SVf_STATIC/SvPVHEK are almost ungoogleable. This optimization is for CPAN XS authors more than for PP code/optree/the interp itself.

About, SVt_INVLIST and SVt_PVOBJ, they probably could be the same type I think. They are both inaccessible/invisible from PP state, not \$ref moveable, and can't PP cast to a string/scalar context at runtime, and aren't "aggregates of [refs to] scalars". Those 2 only (amateur opinion) only have dedicated type codes so they fail a </> LT GT test with everything else RC counted in core and on CPAN since they are non-PP grammer, opaque C data. Therefore by design they have to fail as badly as my decompiling/AST walking attempt below.

C:\sources\perl5>perl -MEncode -E" say \&Encode::decode;"
CODE(0x16f888)
C:\sources\perl5>perl -MEncode -E" say keys %{\&Encode::decode};"
Not a HASH reference at -e line 1.
C:\sources\perl5>

B::Concise is a module and not part of the Perl 5 grammar.

@tonycoz
Copy link
Contributor

tonycoz commented Apr 16, 2025

Most of the SV types are obsolete Perl 4 holdovers, like FORM and IO objects

IO is very much in active use:

$ perl -MDevel::Peek -le 'open my $fh, "<", "perl.h" or die; Dump(*$fh{IO})'
SV = IV(0x55db717bb3f0) at 0x55db717bb400
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x55db7178e4d0
  SV = PVIO(0x55db717bc700) at 0x55db7178e4d0
    REFCNT = 2
    FLAGS = (OBJECT)
    STASH = 0x55db717ba680      "IO::File"
    IFP = 0x55db717ad260
    OFP = 0x0
    DIRP = 0x0
    LINES = 0
    PAGE = 0
    PAGE_LEN = 60
    LINES_LEFT = 0
    TOP_GV = 0x0
    FMT_GV = 0x0
    BOTTOM_GV = 0x0
    TYPE = '<'
    FLAGS = 0x0
$ perl -MDevel::Peek -le 'opendir my $fh, "." or die; Dump(*$fh{IO})'
SV = IV(0x55cb1408a820) at 0x55cb1408a830
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x55cb140394d0
  SV = PVIO(0x55cb14067700) at 0x55cb140394d0
    REFCNT = 2
    FLAGS = (OBJECT)
    STASH = 0x55cb14065470      "IO::File"
    IFP = 0x0
    OFP = 0x0
    DIRP = 0x55cb140bd980
    LINES = 0
    PAGE = 0
    PAGE_LEN = 60
    LINES_LEFT = 0
    TOP_GV = 0x0
    FMT_GV = 0x0
    BOTTOM_GV = 0x0
    TYPE = '\0'
    FLAGS = 0x0

I have use formats, but it's been quite a long time now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Don't merge this PR, at least for now
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants