Macro-op fusion as an intentional instruction set design choice

HubertLamontagne · Post by **HubertLamontagne** » 2021-04-12, 18:27:55

Arm v9 (basically adding Scalable Vector Extensions to the main instruction set) has an interesting new instruction called MOVPRFX:

https://developer.arm.com/documentation ... --?lang=en

The problem that they had is that with the tight integration of fused multiply-accumulate instructions, having 4 register forms would eat up too much instruction space. So instead, FMLA is 3 register (using the destination register as 2nd input to the addition), but it can be preceded with a MOVPRFX instruction. MOVPRFX must have the same destination register as FMLA.

The MOVPRFX instruction can be either executed as a regular vector MOV instruction, or the CPU front-end can use macro-op fusion to combine MOVPRFX and FMLA into a single 4 register fused instruction. So you kindof get the best of both worlds between 3 register FMLA (most FMLA's only really need to be 3op afaik, slower CPU cores can use the 2-step version) and 4 register FMLA (you don't have to waste a micro-op if you need a different destination register).

So I was wondering, could this kind of design be useful for Forwardcom?

Post by **agner** » 2021-04-13, 5:28:33

The x86 instruction set introduced prefixes long ago. Today, there is a lot of different prefixes that are 1, 2, 3, and 4 bytes long. There is no limit to how many prefixes an x86 instruction can have as long as the complete instruction is no more than 15 bytes long. This is a nightmare to decode. The decoder is a serious bottleneck in x86 and this is the reason why they need a micro-op cache after the decoder. The RISC principle was invented to make decoding easier by having fixed instruction length. Now, it looks like ARM is going in the same direction as x86 with prefixes to patch the problems of a limiting design.

ForwardCom avoids this problem by designing a flexible instruction format right from the beginning. ForwardCom defines instructions that can be one, two, or three 32-bit words long. The length is defined by two bits in the first word so that decoding becomes simple. There are different instruction formats with 2, 3, and 4 registers, immediate constants, and memory operands with different addressing modes. Most common instructions can use any of these formats. Two-operand instructions, e.g. addition, can be coded in a 2-register format where the destination is overwriting the first source operand, or a non-destructive 3-register format. Three-operand instructions, such as fused multiply and add, can be coded in a short 3-register format, or a non-destructive 4-register format.

The ForwardCom design is limited to three input operands and a predication mask. A design that allows three input operands opens possibilities for several other useful instructions. Many other instruction sets were originally designed only for 2-input instructions so that patches have to be added later to allow for 3-input instructions. Allowing still more input operands would have high hardware costs because the pipeline needs space to propagate all operands (of 64 bits each) through all stages of the pipeline. Non-destructive formats with a separate destination register has only little extra hardware costs.

Now that I am working on a soft core, I find that it is fairly cheap to have many different instruction formats. You need only a few bits to specify where each operand is, and the operands have been sorted out before they reach the critical ALU stage.

HubertLamontagne · Post by **HubertLamontagne** » 2021-04-13, 20:48:42

I imagine that ARM decided on specifically making 4 operand operations into a fusable two-instruction sequence in order to avoid introducing 64bit opcodes in ARM64 when everything else is 32bit only, and to have an escape chute in case nobody used SVE (or if they have so many register file write ports that fusing the instruction doesn't gain anything).

In your case, you're already fully eating the cost of mixing in 64bit opcodes, so I guess that intentionally putting-in macro-op fusion doesn't make any sense, indeed. And it's good to know that the wide variety of input formats isn't a problem.

HubertLamontagne · Post by **HubertLamontagne** » 2021-06-24, 16:06:03

One extra question here... Not to be too inquisitive here, but I was wondering what is your exact motivation for doing LOAD+MATH in a single instruction instead of a LOAD and MATH instruction sequence on Forwardcom. Is it:

- Is the goal to build in-order CPU pipelines like the 1st generation Intel Atom, where MATH operations run 2 cycles delayed in the pipeline, so that it can be part of a LOAD+MATH sequence without having to introduce stalls?

- For out-of-order CPUs, the LOAD+MATH operation has to be split into 2 micro-ops anyways (LOAD, MATH). The one difference with the two instruction sequence, is that the inner value from the load operation never gets exposed to the instruction sequence if it's a single LOAD+MATH instruction. So the renamed virtual register used for the load operation doesn't have to be part of the main register file (it can go to a specialized register file for LOAD results), and the cpu retirement/gradation logic never really has to deal with it, which means that your main register file needs one less write port (afaik, this is what "micro-op fusion" does on Intel CPUs). RISC cpus like ARM64's need to have more write ports on their register file because of this. Is this the main motivation for LOAD+MATH operations on forwardcom?

- Is your goal to have the same number of micro-ops execute in the out-of-order back-end regardless of if it's LOAD+MATH or LOAD then MATH, and you judge that it's easier to build a front-end that decodes fewer more-complex 2 micro-op instructions rather than a front-end that decodes more simpler 1 micro-op instructions?

Post by **agner** » 2021-06-25, 5:11:33

Hubert, my main motivation for making load+alu instructions is to do more work per instruction = higher throughput. The x86 instruction set is quite efficient despite a terribly complicated decode process, exactly because it does more work per instruction. This is also reducing the register load.

Another reason for having load+alu instructions is to make the instruction set orthogonal. The same instruction can have register operands, immediate constant operands, or memory operands. (No more than one memory operand). Most other instruction sets have limitations on constant operands. For example, it is common in x86 and other instruction sets to use a 32-bit address operand to load a 32-bit floating point constant from memory. This is a waste of data cache. ForwardCom lets you use the 32-bit extra space in the instruction for the floating point constant rather than for a memory address where the constant is stored.

It is not necessary to split a load + alu instruction into two micro-ops. AMD has never done this AFAIK, and Intel has gone back to using a single micro-op.

A drawback of the load+alu design is that the pipeline gets longer, which is bad for mispredicted branches. I have an idea of making a superscalar cpu with a short pipeline for instructions without memory operands, and a longer pipeline for instructions with memory operands.

forwardcom forum

Macro-op fusion as an intentional instruction set design choice

Macro-op fusion as an intentional instruction set design choice

Re: Macro-op fusion as an intentional instruction set design choice

Re: Macro-op fusion as an intentional instruction set design choice

Re: Macro-op fusion as an intentional instruction set design choice

Re: Macro-op fusion as an intentional instruction set design choice