Concerns about amount of allocated opcode space #47

arichardson · 2024-09-19T17:24:55Z

In general I think this extension makes a lot of sense, but I am slightly concerned about how much opcode space is being used here.

While I see that just using the "double-word" encoding makes a lot of sense from a simplicity point, it burns a lot of opcode space: do we really need a 12-bit immediate for the offset?
Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.

Do you have any data showing which immediate values are being used when building some larger projects? Inside loops I'd imagine this to be a very small offset since the base register would be modified and for stack loads/stores the most common offsets would also be quite small (and there is push/pop which replaces lots of the ldp/stp you see in AArch64 function prologs/epilogs).

I am also not sure this extension needs compressed opcodes - is it really that common? I imagine you have a compiler prototype that can show how often it is being used?
For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.

christian-herber-nxp · 2024-09-20T07:49:35Z

Let me start by stating two obsevations:

This is an optional extension, the encoding could still be used by other extensions.
It is not consuming the entire encoding space of ld/sd, but just half of it, because of the limitation to only use even register operands.
Let me address your questions:

do we really need a 12-bit immediate for the offset?

You could ask the same question for any other load/store instruction. 12-bit immediate is e.g. in line with Armv7-M.
Choosing a different immediate length for different load/store instructions would complicate decode.

Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.

This is all in line with the existing load/store instructions. Compressed encodings have scaled immediates. 32b encodings don't.

For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.

I would argue that this extension inherently has more benefit than Zcf and Zcd, as it is helpful for doule precision floating point loads (with Zdinx) and any other 64b data structure.

Here are some results on code size and "performance" using the embench benchmark and a prototypical compiler:

Results will improve as the compiler matures (you can see some clearly bad usages of Zilsd and missed opportunities).
Of course, a big use case will be in DSP kernels using the P extension. Very significant results can be expected there.

arichardson · 2024-09-20T18:19:21Z

Let me start by stating two obsevations:

This is an optional extension, the encoding could still be used by other extensions.

It is not consuming the entire encoding space of ld/sd, but just half of it, because of the limitation to only use even register operands.
Let me address your questions:

do we really need a 12-bit immediate for the offset?

You could ask the same question for any other load/store instruction. 12-bit immediate is e.g. in line with Armv7-M.

Choosing a different immediate length for different load/store instructions would complicate decode.

Additionally, that immediate is unscaled even though it only really makes sense to use it for multiples of 8, wasting 3 bits of the encoding.

This is all in line with the existing load/store instructions. Compressed encodings have scaled immediates. 32b encodings don't.

I agree this is consistent with the existing instructions, but this new instruction doesn't necessary need to follow the same inefficient encoding. I'd imagine a 5-bit scaled immediate would be sufficient for the majority of cases? Can you run objdump on the code generated by your prototype compiler to create a histogram of the used immediates?

For the compressed instructions we would end up using essentially all the remaining encodings freed up by disabling Zcf which seems quite a large impact for what I would expect to be a rather small code size improvement.

I would argue that this extension inherently has more benefit than Zcf and Zcd, as it is helpful for doule precision floating point loads (with Zdinx) and any other 64b data structure.

Here are some results on code size and "performance" using the embench benchmark and a prototypical compiler:

Results will improve as the compiler matures (you can see some clearly bad usages of Zilsd and missed opportunities). Of course, a big use case will be in DSP kernels using the P extension. Very significant results can be expected there.

These numbers look good for some of those benchmarks, but I'd like to know what ISA string this was built with. Since you use the entire Zcf opcode space this extension is incompatible with Zcmp and I would imagine push/pop has a larger impact on this benchmark overall?

I am also very surprised by the cubic numbers - looking at the code this only performs floating point operations - I assume you were building for soft-float?

christian-herber-nxp · 2024-09-21T10:24:12Z

These numbers look good for some of those benchmarks, but I'd like to know what ISA string this was built with. Since you use the entire Zcf opcode space this extension is incompatible with Zcmp and I would imagine push/pop has a larger impact on this benchmark overall?

I am also very surprised by the cubic numbers - looking at the code this only performs floating point operations - I assume you were building for soft-float?

I should have shared the isa string. Baseline is
rv32im_zce_zdinx -mabi=ilp32 -mno-strict-align –Oz

The benchmarks with very high code size reduction are those that have a high exposure to double.
Note that Zce includes Zcmp. Without Zcmp, results would generally be 3% better.

I do not have statistics for the immediate distribution, but embench would clearly not be representative here (In fact, most benchmarks are simply to small to have realistic immediate distribution).

Please also understand that the specification is currently in the final phases of architecture review. I am happy to get questions and input in all phases, but it is best to provide during the internal review period, which is long past.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerns about amount of allocated opcode space #47

Concerns about amount of allocated opcode space #47

arichardson commented Sep 19, 2024

christian-herber-nxp commented Sep 20, 2024

arichardson commented Sep 20, 2024

christian-herber-nxp commented Sep 21, 2024

Concerns about amount of allocated opcode space #47

Concerns about amount of allocated opcode space #47

Comments

arichardson commented Sep 19, 2024

christian-herber-nxp commented Sep 20, 2024

arichardson commented Sep 20, 2024

christian-herber-nxp commented Sep 21, 2024