Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns about amount of allocated opcode space #47

Open
arichardson opened this issue Sep 19, 2024 · 5 comments
Open

Concerns about amount of allocated opcode space #47

arichardson opened this issue Sep 19, 2024 · 5 comments

Comments

@arichardson
Copy link

In general I think this extension makes a lot of sense, but I am slightly concerned about how much opcode space is being used here.

While I see that just using the "double-word" encoding makes a lot of sense from a simplicity point, it burns a lot of opcode space: do we really need a 12-bit immediate for the offset?
Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.

Do you have any data showing which immediate values are being used when building some larger projects? Inside loops I'd imagine this to be a very small offset since the base register would be modified and for stack loads/stores the most common offsets would also be quite small (and there is push/pop which replaces lots of the ldp/stp you see in AArch64 function prologs/epilogs).

I am also not sure this extension needs compressed opcodes - is it really that common? I imagine you have a compiler prototype that can show how often it is being used?
For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.

@christian-herber-nxp
Copy link
Collaborator

Let me start by stating two obsevations:

  • This is an optional extension, the encoding could still be used by other extensions.
  • It is not consuming the entire encoding space of ld/sd, but just half of it, because of the limitation to only use even register operands.
    Let me address your questions:
  1. do we really need a 12-bit immediate for the offset?
  • You could ask the same question for any other load/store instruction. 12-bit immediate is e.g. in line with Armv7-M.
  • Choosing a different immediate length for different load/store instructions would complicate decode.
  1. Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.
  • This is all in line with the existing load/store instructions. Compressed encodings have scaled immediates. 32b encodings don't.
  1. For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.
  • I would argue that this extension inherently has more benefit than Zcf and Zcd, as it is helpful for doule precision floating point loads (with Zdinx) and any other 64b data structure.

Here are some results on code size and "performance" using the embench benchmark and a prototypical compiler:

image

Results will improve as the compiler matures (you can see some clearly bad usages of Zilsd and missed opportunities).
Of course, a big use case will be in DSP kernels using the P extension. Very significant results can be expected there.

@arichardson
Copy link
Author

Let me start by stating two obsevations:

  • This is an optional extension, the encoding could still be used by other extensions.
  • It is not consuming the entire encoding space of ld/sd, but just half of it, because of the limitation to only use even register operands.
    Let me address your questions:
  1. do we really need a 12-bit immediate for the offset?
  • You could ask the same question for any other load/store instruction. 12-bit immediate is e.g. in line with Armv7-M.
  • Choosing a different immediate length for different load/store instructions would complicate decode.
  1. Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.
  • This is all in line with the existing load/store instructions. Compressed encodings have scaled immediates. 32b encodings don't.

I agree this is consistent with the existing instructions, but this new instruction doesn't necessary need to follow the same inefficient encoding. I'd imagine a 5-bit scaled immediate would be sufficient for the majority of cases? Can you run objdump on the code generated by your prototype compiler to create a histogram of the used immediates?

  1. For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.
  • I would argue that this extension inherently has more benefit than Zcf and Zcd, as it is helpful for doule precision floating point loads (with Zdinx) and any other 64b data structure.

Here are some results on code size and "performance" using the embench benchmark and a prototypical compiler:

image

Results will improve as the compiler matures (you can see some clearly bad usages of Zilsd and missed opportunities). Of course, a big use case will be in DSP kernels using the P extension. Very significant results can be expected there.

These numbers look good for some of those benchmarks, but I'd like to know what ISA string this was built with. Since you use the entire Zcf opcode space this extension is incompatible with Zcmp and I would imagine push/pop has a larger impact on this benchmark overall?

I am also very surprised by the cubic numbers - looking at the code this only performs floating point operations - I assume you were building for soft-float?

@christian-herber-nxp
Copy link
Collaborator

These numbers look good for some of those benchmarks, but I'd like to know what ISA string this was built with. Since you use the entire Zcf opcode space this extension is incompatible with Zcmp and I would imagine push/pop has a larger impact on this benchmark overall?

I am also very surprised by the cubic numbers - looking at the code this only performs floating point operations - I assume you were building for soft-float?

I should have shared the isa string. Baseline is
rv32im_zce_zdinx -mabi=ilp32 -mno-strict-align –Oz

The benchmarks with very high code size reduction are those that have a high exposure to double.
Note that Zce includes Zcmp. Without Zcmp, results would generally be 3% better.

I do not have statistics for the immediate distribution, but embench would clearly not be representative here (In fact, most benchmarks are simply to small to have realistic immediate distribution).

Please also understand that the specification is currently in the final phases of architecture review. I am happy to get questions and input in all phases, but it is best to provide during the internal review period, which is long past.

@arichardson
Copy link
Author

I realize my feedback is most likely too late - I only happened to see the email about ratification of this extension last week so apologies for that.

Looking at the code generation for these benchmarks I don't see any large immediates being used and only really see benefits for Zdinx. Based on this data I don't see a rationale for the full 12-bit immediate.

I would like to point out that I do believe this extension is helpful for code size, I just don't think it needs a full 12-bit immediate. And once a full opcode has been allocated it becomes difficult to reclaim this space. Do you have any data for programs that are larger that just one small benchmark file? I imagine it helps on your internal programs - knowing how much benefit you get one some internal benchmarks would also be helpful.

I will try building your toolchain and run this on a larger program.

@christian-herber-nxp
Copy link
Collaborator

I realize my feedback is most likely too late - I only happened to see the email about ratification of this extension last week so apologies for that.

Looking at the code generation for these benchmarks I don't see any large immediates being used and only really see benefits for Zdinx. Based on this data I don't see a rationale for the full 12-bit immediate.

Embench is far too small of a benchmark to make use of the full immediate value range. Same is true for any other load/store in any ISA. Still 12-bit is the norm.
Also, while Zilsd might not always have code size benefits, it will have performance benefits by merging two compressed instructions.

I would like to point out that I do believe this extension is helpful for code size, I just don't think it needs a full 12-bit immediate. And once a full opcode has been allocated it becomes difficult to reclaim this space. Do you have any data for programs that are larger that just one small benchmark file? I imagine it helps on your internal programs - knowing how much benefit you get one some internal benchmarks would also be helpful.

I will try building your toolchain and run this on a larger program.

That would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants