Description
In general I think this extension makes a lot of sense, but I am slightly concerned about how much opcode space is being used here.
While I see that just using the "double-word" encoding makes a lot of sense from a simplicity point, it burns a lot of opcode space: do we really need a 12-bit immediate for the offset?
Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.
Do you have any data showing which immediate values are being used when building some larger projects? Inside loops I'd imagine this to be a very small offset since the base register would be modified and for stack loads/stores the most common offsets would also be quite small (and there is push/pop which replaces lots of the ldp/stp you see in AArch64 function prologs/epilogs).
I am also not sure this extension needs compressed opcodes - is it really that common? I imagine you have a compiler prototype that can show how often it is being used?
For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.