Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass opline as argument to opcode handlers in CALL VM #17952

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

arnaud-lb
Copy link
Member

@arnaud-lb arnaud-lb commented Feb 28, 2025

Related:

This extracts the part of #17849 that passes opline as opcode handler argument in the Call VM. This should reduce the size of #17849, and also unify the Hybrid and Call VMs slightly.

Currently we have two VMs:

  • Hybrid: Used when compiling with GCC, execute_data and opline are global register variables
  • Call: Used when compiling with something else, execute_data is passed as opcode handler arg, and opline is loaded/stored from/to execute_data->opline.

The Call VM looks like this:

while (1) {
    ret = execute_data->opline->handler(execute_data);
    if (UNEXPECTED(ret != 0)) {
        if (ret > 0) { // returned by ZEND_VM_ENTER() / ZEND_VM_LEAVE()
            execute_data = EG(current_execute_data);
        } else {       // returned by ZEND_VM_RETURN()
            return;
        }
    }
}

// example op handler
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) {
    // load opline
    const zend_op *opline = execute_data->opline;

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    execute_data->opline++;
    return 0; // ZEND_VM_CONTINUE()
}

Opcode handlers return a positive value to signal that the loop must load a new execute_data from EG(current_execute_data), typically when entering or leaving a function.

Changes in this PR

  • Pass opline as opcode handler argument
  • Return opline from opcode handlers
  • ZEND_VM_ENTER / ZEND_VM_LEAVE return opline | (1<<63) to signal that execute_data must be reloaded from EG(current_execute_data)

This gives us:

while (1) {
    opline = opline->handler(execute_data, opline);
    if (UNEXPECTED((intptr_t) opline <= 0) {
        if (opline != 0) { // returned by ZEND_VM_ENTER() / ZEND_VM_LEAVE()
            opline = opline & ~(1<<63);
            execute_data = EG(current_execute_data);
        } else {           // returned by ZEND_VM_RETURN()
            return;
        }
    }
}

// example op handler
const zend_op * ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_op *opline) {
    // opline already loaded

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    return ++opline;
}

In addition to making the changes of #17849 smaller and to unifying VMs slightly, this improves performances of the Call VM:

bench.php is 23% faster on a Linux / x86_64:

; hyperfine -L php no-opline,opline-arg '/tmp/{php}/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php'
Benchmark 1: /tmp/no-opline/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php
  Time (mean ± σ):     549.4 ms ±   4.7 ms    [User: 536.3 ms, System: 11.8 ms]
  Range (min … max):   542.2 ms … 558.6 ms    10 runs
 
Benchmark 2: /tmp/opline-arg/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php
  Time (mean ± σ):     445.5 ms ±   2.7 ms    [User: 432.5 ms, System: 12.0 ms]
  Range (min … max):   441.3 ms … 449.0 ms    10 runs
 
Summary
  /tmp/opline-arg/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php ran
    1.23 ± 0.01 times faster than /tmp/no-opline/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php

And 18% faster on MacOS / M1.

Symfony Demo is 2.8% faster:

base:        mean:  0.5373;  stddev:  0.0004;  diff:  -0.00%
opline-arg:  mean:  0.5222;  stddev:  0.0008;  diff:  -2.82%

JIT

When using the Hybrid VM, JIT stores execute_data/opline in two fixed callee-saved registers and rarely touches EX(opline), just like the VM.

Since the registers are callee-saved, the JIT'ed code doesn't have to save them before calling other functions, and can assume they always contain execute_data/opline. The code also avoids saving/restoring them in prologue/epilogue, as execute_ex takes care of that (JIT'ed code is called exclusively from there).

When using the Call VM, we can do that, too, except that we can't rely on execute_ex to save the registers for us, as it may use these registers it self. So we have to save/restore the two registers in JIT'ed code prologue/epilogue.

TODO

  • Test x86
  • Test aarch64
  • Test Windows

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting. This will work with MSVC as well, right?

On bench.php callgrind shows significant improvement without JIT, and slight improvement with tracing JIT.

As I understood, the main improvement comes from elimination of EX(opline) loading in each instruction handler (elimination of USE_OPLINE) and improvement of prologues and epilogues of short handlers. Right? May be I missed something?

I'm really surprised in the effect. I understand the improvement coming from caching EX(opline) in a preserved CPU register, but in my opinion, passing and returning it across all handlers should smooth the effect from load elimination. It seems I was wrong.

I think, this should be finalized, carefully reviewed and merged.

Comment on lines 373 to 379
# define ZEND_OPCODE_HANDLER_ARGS zend_execute_data *execute_data
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU execute_data
# define ZEND_OPCODE_HANDLER_ARGS_DC , ZEND_OPCODE_HANDLER_ARGS
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU_CC , ZEND_OPCODE_HANDLER_ARGS_PASSTHRU
# define ZEND_OPCODE_HANDLER_ARGS zend_execute_data *execute_data, const zend_op *opline
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU execute_data, opline
# define ZEND_OPCODE_HANDLER_ARGS_DC ZEND_OPCODE_HANDLER_ARGS,
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU_CC ZEND_OPCODE_HANDLER_ARGS_PASSTHRU,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally PHP used _D, _C, _DC and _CC suffixes for macros that implements implementation defined arguments. I can't remember what does they mean. Probably C - comma and D - declaration.
@derickr do you remeber?

Probably, we shouldn't use the same convention for something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm not sure this change (commit 7d7be6d) makes a difference anymore, so I may revert it.

Copy link
Member

@bwoebi bwoebi Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, call, comma and declaration as far as I remember. CC = call comma; DC = declaration comma.
So this still fits.

Comment on lines 440 to 441
# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1))
# define ZEND_VM_ENTER_EX() return (zend_op*)((uintptr_t)opline | ZEND_VM_ENTER_BIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption that the high bit of address must be zero may be wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's right at least on x86_64, but I will check other platforms. I hope I can keep this scheme as it allows this check to be a single instruction:

out($f, $m[1]."if (UNEXPECTED((intptr_t)OPLINE <= 0))".$m[3]."\n");

Copy link
Member Author

@arnaud-lb arnaud-lb Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added support for systems where the high bit may be set in user space addresses: 41335c4

On these systems I assume that alignment is > 1 (we make this assumption in a few places already), and I define ZEND_VM_ENTER_BIT to 1.

ZEND_VM_RETURN() becomes return ZEND_VM_ENTER_BIT instead of return 0, and ZEND_VM_LEAVE() / ZEND_VM_ENTER() remain return opline | ZEND_VM_ENTER_BIT.

The check

if (UNEXPECTED((intptr_t)opline <= 0)) {

becomes

if (UNEXPECTED(((uintptr_t)opline & ZEND_VM_ENTER_BIT))) {

which compiles very similarly:

test   $0x1,%al
je     0xa5f740

Instead of

test   %rax,%rax
jg     0xa5f6f0

Both have the same encoded size.

The alignment variant is very slightly faster according to both valgrind and wall time. I think that's because the constant 1ULL<<63 results in slightly bigger code in other places.

So I may keep only the alignment variant, as it's more portable. WDYT?

#ifdef ZEND_VM_FP_GLOBAL_REG
execute_data = vm_stack_data.orig_execute_data;
# ifdef ZEND_VM_IP_GLOBAL_REG
opline = vm_stack_data.orig_opline;
# endif
return;
#else
if (EXPECTED(ret > 0)) {
if (EXPECTED(opline != NULL && (uintptr_t)opline != ZEND_VM_ENTER_BIT)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check must be more expensive...
The second part looks incorrect. I suppose it should check only a single bit.

Copy link
Member Author

@arnaud-lb arnaud-lb Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have opline == ZEND_VM_ENTER_BIT because the JIT will return opline | ZEND_VM_ENTER_BIT unconditionally to force the VM to reload execute_data, even when opline might be 0.

I've updated this to

opline = (const zend_op*)((uintptr_t)opline & ~ZEND_VM_ENTER_BIT);
if (EXPECTED(opline != NULL)) {

Which is similar, but clearer. This adds just one instruction compared to the base branch.

Comment on lines -3047 to +3077
static int ZEND_FASTCALL zend_runtime_jit(void)
static ZEND_OPCODE_HANDLER_RET ZEND_FASTCALL zend_runtime_jit(ZEND_OPCODE_HANDLER_ARGS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops. It looks like the previous prototype was wrong.

#if GCC_GLOBAL_REGS
return; // ZEND_VM_CONTINUE
#else
return op_array->opcodes; // ZEND_VM_CONTINUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may skip leading RECV instructions. Actually, you should return the opline passed as argument or EX(opline).

Comment on lines -1146 to -1149
#ifndef HAVE_GCC_GLOBAL_REGS
opline = EX(opline);
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work with other VM types (GOTO and SWITCH)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT doesn't support the GOTO and SWITCH VMs currently:

zend_error(E_WARNING, "JIT is compatible only with CALL and HYBRID VM. JIT disabled.");

I'm assuming it's because zend_get_opcode_handler_func() can not be implemented, as there are no opcode handler functions:

ZEND_API const void* ZEND_FASTCALL zend_get_opcode_handler_func(const zend_op *op)

We could support these VM kinds in the future by generating handler funcs in addition to the big SWITCH/GOTO. If these functions use the same conventions as the CALL VM, then the changes in this PR will work with these VM kinds. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't mind supporting JIT for SWITCH/GOTO. Just SWITCH/GOTO VM interpreter.

@arnaud-lb
Copy link
Member Author

This is really interesting. This will work with MSVC as well, right?

I assume so, but I still need to test on Windows/MSVC, x86, aarch64.

On bench.php callgrind shows significant improvement without JIT, and slight improvement with tracing JIT.

As I understood, the main improvement comes from elimination of EX(opline) loading in each instruction handler (elimination of USE_OPLINE) and improvement of prologues and epilogues of short handlers. Right? May be I missed something?

I'm really surprised in the effect. I understand the improvement coming from caching EX(opline) in a preserved CPU register, but in my opinion, passing and returning it across all handlers should smooth the effect from load elimination. It seems I was wrong.

Yes, it is my intuition that eliminating the load/stores of EX(opline) is what yields the improvements.

A comparison of the annotated code of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER (the hottest handler in bench.php) between branches shows that we have less instructions in the current branch, and less memory accesses:

Base branch:

Dump of assembler code for function ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER:
   0x0000000000aa5a10 <+0>:	mov    (%rdi),%rax          // opline = EX(opline)
   0x0000000000aa5a13 <+3>:	movslq 0x8(%rax),%rcx
   0x0000000000aa5a17 <+7>:	movslq 0xc(%rax),%rdx
   0x0000000000aa5a1b <+11>:	movslq 0x10(%rax),%rsi
   0x0000000000aa5a1f <+15>:	movsd  (%rdi,%rcx,1),%xmm0
   0x0000000000aa5a24 <+20>:	addsd  (%rdi,%rdx,1),%xmm0
   0x0000000000aa5a29 <+25>:	movsd  %xmm0,(%rdi,%rsi,1)
   0x0000000000aa5a2e <+30>:	movl   $0x5,0x8(%rdi,%rsi,1)
   0x0000000000aa5a36 <+38>:	add    $0x20,%rax          // opline++
   0x0000000000aa5a3a <+42>:	mov    %rax,(%rdi)         // EX(opline) = opline
   0x0000000000aa5a3d <+45>:	xor    %eax,%eax           // ret = 0
   0x0000000000aa5a3f <+47>:	ret

Current branch:

Dump of assembler code for function ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER:
   0x0000000000aa68a0 <+0>:	movslq 0x8(%rsi),%rax
   0x0000000000aa68a4 <+4>:	movslq 0xc(%rsi),%rcx
   0x0000000000aa68a8 <+8>:	movslq 0x10(%rsi),%rdx
   0x0000000000aa68ac <+12>:	movsd  (%rdi,%rax,1),%xmm0
   0x0000000000aa68b1 <+17>:	addsd  (%rdi,%rcx,1),%xmm0
   0x0000000000aa68b6 <+22>:	movsd  %xmm0,(%rdi,%rdx,1)
   0x0000000000aa68bb <+27>:	movl   $0x5,0x8(%rdi,%rdx,1)
   0x0000000000aa68c3 <+35>:	lea    0x20(%rsi),%rax    // ret = ++opline
   0x0000000000aa68c7 <+39>:	ret

It's possibly slightly slower than using global fixed registers, but in comparison to using EX(opline), it's almost equivalent. In the fast path, opline will just be held in either %rsi or %rax and will not be spilled. It will need to be moved back and forth between the two registers when returning from and calling op handlers, but this is less expensive than loading/storing EX(opline) and uses less instructions. We occasionally need to preserve %rsi or %rax before function calls, but this tends to happen only in slow paths, and the save can be in an other register, not necessarily on the stack.

In the case of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER, opline remains in %rsi, and is only moved to %rax before ret: lea 0x20(%rsi),%rax.

@dstogov
Copy link
Member

dstogov commented Mar 3, 2025

Yes, it is my intuition that eliminating the load/stores of EX(opline) is what yields the improvements.

Right. I missed store. And this should make the biggest impact to real-life performance.

A comparison of the annotated code of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER (the hottest handler in bench.php).

These specialized instructions are rare on real-life loads. Your tests showed relatively small improvement on Symfony. Note that finalizing may decrease this even more. On the other hand some related optimizations may be discovered.

It's possibly slightly slower than using global fixed registers, but in comparison to using EX(opline), it's almost equivalent. In the fast path, opline will just be held in either %rsi or %rax and will not be spilled. It will need to be moved back and forth between the two registers when returning from and calling op handlers, but this is less expensive than loading/storing EX(opline) and uses less instructions. We occasionally need to preserve %rsi or %rax before function calls, but this tends to happen only in slow paths, and the save can be in an other register, not necessarily on the stack.

In the case of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER, opline remains in %rsi, and is only moved to %rax before ret: lea 0x20(%rsi),%rax.

I see, and this makes sense.

@@ -1934,6 +1934,7 @@ function run_test(string $php, $file, array $env): string
$diff_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'diff';
$log_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'log';
$exp_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'exp';
$stdin_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'stdin';
Copy link
Member Author

@arnaud-lb arnaud-lb Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file will be go to a separate PR

So that execute_ex/opline are always passed via (the same) registers
Instead of returning -opline. This simplifies JIT as we can simply
return -1 for VM_ENTER/VM_LEAVE. However, this implies that EX(opline) must be
in sync.
This reverts commit 9aaceaf6a2480836ae59d9657d37b10bbe04268e.
This simplies JIT compared to returning -opline,
and doesn't require EX(opline) to be in sync.
Saving them in execute_ex is not safe when not assigning them to global regs,
as the compiler allocate them.
…U_CC

As they do not follow the _DC / _CC convention anymore
@arnaud-lb arnaud-lb changed the title Pass opline as argument to opcode handlers in CALL VM [wip] Pass opline as argument to opcode handlers in CALL VM Mar 14, 2025
Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not able to make a deep review now.
The patch is big, but I think it may make sense.

It shows improvement without JIT.
With JIT I see degradation (because of useless moving argument to %r15).
It may be not easy to eliminate this.

I don't like ZEND_VM_ENTER_BIT magic. If reloading opline form EX(opline) on ENTER/LEAVE won't make a big difference, I would prefer to do this.

Please, see the questions.

Comment on lines -5605 to +5608
#ifdef ZEND_VM_IP_GLOBAL_REG
USE_OPLINE

SAVE_OPLINE();
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes EX(opline) = opline when ZEND_VM_IP_GLOBAL_REG is defined.
This may cause output of incorrect line number in error message.
May be I miss something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct.

This causes opline to be saved to EX(opline) in the CALL VM (and in the HYBRID VM, as before). Saving used to be unnecessary as EX(opline) was always up to date in the CALL VM. I believe that the #ifdef is redundant as SAVE_OPLINE() was a no-op on the CALL VM.

Comment on lines +1954 to +1956
out($f,"# define ZEND_VM_ENTER_BIT 1ULL\n");
out($f,"# endif\n");
out($f,"# define ZEND_VM_ENTER_EX() return (zend_op*)((uintptr_t)opline | ZEND_VM_ENTER_BIT)\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't guaranty that C compiler always align labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return just some constant (e.g. NULL) and then reload opline from EX(opline)?
Or this is exactly that you are trying to optimize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't guaranty that C compiler always align labels.

Possibly I don't understand what you mean, but opline is not a label in this case, it's a zend_op*. I think it's very unlikely that zend_op* would not be aligned, unless some architecture has no alignment requirements (in this case we could still enforce that oplines are aligned).

Can we return just some constant (e.g. NULL) and then reload opline from EX(opline)?
Or this is exactly that you are trying to optimize?

Yes this is part of the optimization, and it also simplifies JIT (see #17952 (comment)).

Comment on lines +226 to +232
# ifdef ZEND_HIGH_HALF_KERNEL
# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1))
# define ZEND_VM_RETURN_VAL 0
# else
# define ZEND_VM_ENTER_BIT 1ULL
# define ZEND_VM_RETURN_VAL ZEND_VM_ENTER_BIT
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we redifine ZEND_VM_ENTER_BIT here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zend_jit_internal.h has its own definition of some VM macros.

Here I define ZEND_VM_ENTER_BIT twice because of #17952 (comment). The non-ZEND_HIGH_HALF_KERNEL case is faster and more portable, so I'm going to keep only #define ZEND_VM_ENTER_BIT 1ULL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to "teach" IR testing the high bit checking the sign (I can do this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do it, I would be happy to check how it affects performances

Comment on lines -1146 to -1149
#ifndef HAVE_GCC_GLOBAL_REGS
opline = EX(opline);
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't mind supporting JIT for SWITCH/GOTO. Just SWITCH/GOTO VM interpreter.

@dstogov
Copy link
Member

dstogov commented Mar 17, 2025

JIT leave also became more expansive

@@ -1287,8 +1295,11 @@
        jne JIT$$leave_function_handler
        movq %r14, EG(vm_stack_top)
        movq 0x30(%r14), %r14
-       addq $0x20, (%r14)
-       movl $2, %eax
+       movq (%r14), %r15
+       addq $0x20, %r15
+       movq %r15, %rax
+       movabsq $9223372036854775808, %rcx
+       orq %rcx, %rax
        addq $0x38, %rsp
        popq %r15
        popq %r14

@dstogov
Copy link
Member

dstogov commented Mar 17, 2025

BTW: I told about degradation in term of callgrind instructions, that is not the same as real-time.
The real performance slightly improved. Anyway, the effect on JIT needs to be analysed.

@@ -1064,6 +1064,7 @@ const char *ir_reg_name(int8_t reg, ir_type type)
_(SSE_CEIL) \
_(SSE_TRUNC) \
_(SSE_NEARBYINT) \
_(OR_PWR2) \
Copy link
Member Author

@arnaud-lb arnaud-lb Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in ext/opcache/jit/ir will be submitted as a proper PR on IR repository.

Here I add an optimization so that

ir_OR(x, y)

is compiled to a single bts instruction when y is a large power of two. This avoids copying y to a register first (because or doesn't accept an imm64 op2), saving an instruction and reducing code size.

So this optimizes this:

movabsq $9223372036854775808, %rcx
orq %rcx, %rax

Into this:

btsq $63, %rax

GCC does this optimization: https://godbolt.org/z/ovEWGPr3r (but not Clang).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but please name the rule BTS. Even better implement the similar optimization for BTR and use the common rule( e.g. BIT_OP) . I would glad to accept this for IR independently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a PR here: dstogov/ir#111

@arnaud-lb
Copy link
Member Author

arnaud-lb commented Mar 27, 2025

It shows improvement without JIT.
With JIT I see degradation (because of useless moving argument to %r15).
It may be not easy to eliminate this.

I had not tested with JIT+valgrind before, and I confirm that this increases the number of executed instructions by ~1%.

However we execute more instructions per cycle in average, so this is still a net 2% improvement on wall time with JIT enabled.

I found no evidence that the increase in executed instructions is attributable to something else than the save of %r15 and move of opline to %r15 in prologue, and to OR-ing opline with ZEND_VM_ENTER_BIT in zend_jit_leave_func() / zend_jit_trace_return().

Regarding %r15: I initially tried to use %rsi as fixed register for opline, as we receive opline in that register, but IR doesn't support using an arg register as fixed register currently (it overrides the register during function calls). Do you have an idea of how complex it would be to support that? We could also move execute_data to %rdi instead of %r14.

It would eliminate 4 instructions in the prologue (2 saves and 2 moves). The generated code would occasionally have to save %rdi/%rsi when calling other functions, but we tend to do that in slow paths anyway.

Regarding zend_jit_leave_func() / zend_jit_trace_return():

I pushed an optimization in 0bf80cf that removes the movabsq $9223372036854775808, %rcx:

@@ -1287,8 +1295,11 @@
        jne JIT$$leave_function_handler
        movq %r14, EG(vm_stack_top)
        movq 0x30(%r14), %r14
-       addq $0x20, (%r14)
-       movl $2, %eax
+       movq (%r14), %r15
+       addq $0x20, %r15
+       movq %r15, %rax
+       btsq $63, %rax
        addq $0x38, %rsp
        popq %r15
        popq %r14

The increase is not that bad, because it actually removes at least two memory operations:

@@ -1287,8 +1295,11 @@
// zend_jit_leave_func()
        jne JIT$$leave_function_handler
        movq %r14, EG(vm_stack_top)
        movq 0x30(%r14), %r14
- // EX(opline)++
-       addq $0x20, (%r14)

+ // Two instructions, but only one memory read instead of a read-write:
+ // opline = EX(opline)
+ // opline++
+       movq (%r14), %r15
+       addq $0x20, %r15

// zend_jit_trace_return()


- // return 2
-       movl $2, %eax

+ // Two instructions instead of one, but we avoid loading `EX(opline)` in the caller:
+ // return opline | ZEND_VM_ENTER_BIT
+       movq %r15, %rax
+       btsq $61, %rax

        addq $0x38, %rsp
        popq %r15
        popq %r14

Even if the memory operations are cached, the new code will execute in less cycles.

I don't like ZEND_VM_ENTER_BIT magic. If reloading opline form EX(opline) on ENTER/LEAVE won't make a big difference, I would prefer to do this.

I tried, but this makes the JIT more complex and this may cancel all benefits as well.

In the base branch, we have two cases:

  • HYBRID VM: We propagate opline primarily via %r15, and occasionally update EX(opline)
  • CALL VM: We propagate opline exclusively via EX(opline)

In the current branch, the cases where we need to update EX(opline) are the same in both VMs, which simplifies JIT.

The ZEND_VM_ENTER_BIT flag is very convenient because it allows us to differentiate 3 op handler return cases, while still propagating opline via the return value:

  • ZEND_VM_CONTINUE: return opline
  • ZEND_VM_ENTER/LEAVE: return opline | ZEND_VM_ENTER_BIT
  • ZEND_VM_RETURN: return 0 | ZEND_VM_ENTER_BIT

This avoids propagating opline via EX(opline), which is faster, but also simpler. If we remove the ZEND_VM_ENTER_BIT flag, we need to add more EX(opline) load / stores in the JIT'ed code again (and knowing when it's necessary can be tricky).

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see minor JIT optimization opportunities.
The main idea looks OK and I see improvement on x86_64 CLANG build.
Performance of AArch64 CLANG build must be checked.

I don't object against this.
It would be great if @iluuu1994 and @nielsdos could also take a look.

# ifdef ZEND_HIGH_HALF_KERNEL
# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1))
# else
# define ZEND_VM_ENTER_BIT 1ULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires alignment of function addresses. I afraid this may be violated by "-Os".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used on zend_op* addresses (opline, not opline->handler), so it should be ok.

@@ -1064,6 +1064,7 @@ const char *ir_reg_name(int8_t reg, ir_type type)
_(SSE_CEIL) \
_(SSE_TRUNC) \
_(SSE_NEARBYINT) \
_(OR_PWR2) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but please name the rule BTS. Even better implement the similar optimization for BTR and use the common rule( e.g. BIT_OP) . I would glad to accept this for IR independently.

Comment on lines +226 to +232
# ifdef ZEND_HIGH_HALF_KERNEL
# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1))
# define ZEND_VM_RETURN_VAL 0
# else
# define ZEND_VM_ENTER_BIT 1ULL
# define ZEND_VM_RETURN_VAL ZEND_VM_ENTER_BIT
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to "teach" IR testing the high bit checking the sign (I can do this).

ref = jit_FP(jit);
ref = ir_CALL_1(IR_VOID, ir_CONST_FC_FUNC(handler), ref);
ir_ref ip = ir_CALL_2(IR_ADDR, ir_CONST_FC_FUNC(handler), jit_FP(jit), jit_IP(jit));
jit_LOAD_IP(jit, ip);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This load is not always necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example

	callq ZEND_BIND_STATIC_SPEC_CV_HANDLER
	movq %rax, %r15   ; one of the moves is useless (we may move %rax to %rsi)
	movq %r14, %rdi
	movq %r15, %rsi
	callq ZEND_BIND_STATIC_SPEC_CV_HANDLER

Comment on lines 914 to 917
static void jit_LOAD_IP(zend_jit_ctx *jit, ir_ref ref)
{
if (GCC_GLOBAL_REGS) {
jit_STORE_IP(jit, ref);
} else {
ir_STORE(jit_EX(opline), ref);
}
jit_STORE_IP(jit, ref);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the optimization jit_LOAD_IP() becames useless. It's better to remove it and use jit_STORE_IP().

Comment on lines +11170 to +11171
jit_STORE_IP(jit, ir_LOAD_A(jit_EX(opline)));
jit_STORE_IP(jit, ir_ADD_OFFSET(jit_IP(jit), sizeof(zend_op)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These stores are not necessary for CALL VM.

	movq (%r14), %r15      => movq (%r14), %rax
	addq $0x20, %r15       => addq #0x20, %rax
	movq %r15, %rax        => nop
	btsq $0x3f, %rax         
	addq $0x38, %rsp
	popq %r15
	popq %r14
	popq %r13
	popq %r12
	popq %rbp
	popq %rbx
	retq

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into this but it's difficult to avoid, as the store is in zend_jit_leave_func(), but the movq %r15, %rax; btsq $0x3f, %rax is from zend_jit_trace_return().

Do you have a suggestion how to address this?

If that's ok for you, I would like to merge this PR (waiting a few days for other reviews) and move to the next step of #17849 (tail calling), as this may change a few things again.

After that I will come back to this comment as well as #17952 (comment).

@arnaud-lb arnaud-lb changed the title [wip] Pass opline as argument to opcode handlers in CALL VM Pass opline as argument to opcode handlers in CALL VM Apr 1, 2025
@arnaud-lb
Copy link
Member Author

Benchmarked on MacOS with a M1 CPU: bench.php is 18% faster without JIT, and 0% faster with JIT. I expect these numbers to increase with tail call and calling convention optimizations.

Also checked the effect of the ZEND_HIGH_HALF_KERNEL variant: It makes no difference on M1, and is a bit slower on x86_64. Therefore, I'm going to remove the ZEND_HIGH_HALF_KERNEL variant before eventually merging (so ZEND_VM_ENTER_BIT will always be 1).

@arnaud-lb arnaud-lb mentioned this pull request Apr 2, 2025
9 tasks
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nits, looks really good.
I see great performance improvements in both bench.php and micro_bench.php on Clang

/* The kernel reserves the higher part of the address space for itself.
* Therefore, we can assume that the higher bit of user space addresses is
* never set. */
# define ZEND_HIGH_HALF_KERNEL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I didn't like this higher half stuff because it adds more complexity, but I'm glad to see it doesn't make an improvement so this can be simplified

* of saving ZREG_FP, ZREG_IP when GCC_GLOBAL_REGS is 1, so we don't
* have to save them. When GCC_GLOBAL_REGS is 1, always save them.
*/
#if GCC_GLOBAL_REGS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This didn't actually change, but adds a bit of noise to the diff (also indent on commented code is now off)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants