Possible difficulties for microcode-less implementations

Kulasko · Post by **Kulasko** » 2020-03-06, 5:15:40

I recently had some difficulty while planning a microcode-less CPU design based on ForwardCom. Specifically, I am having some trouble with the restore_cps instruction.
The restore_cps instruction is used for efficiently loading a vector from memory. The vector is saved in a compressed image with variable length. Herein lies the problem: In order to read the vector, the instruction has to know the length of it. The length is saved in the compressed image, so the instruction has to read the length before reading the rest of the vector, meaning two necessary memory loads. I found three ways to make it work without explicit microcode:

Supporting it as a read mode in the load/store unit. This wouldn't explicitly require microcode, but essentially make the LS-unit microcoded.
Not compressing the vector, making the image size constant.
Designing a pipeline where two reads are possible, just for this instruction.

Neither of those options are ideal, but the instruction is needed as it has an important role in efficiently saving and restoring registers On the other hand, the ability to build simple, microcode-less implementations is desirable.
Has anyone an idea how to get this functionality without microcode?

Post by **agner** » 2020-03-06, 9:18:07

Thank you for looking at this.

This instruction is so important that it is worth some extra hardware. I am imagining a state machine for this instruction. You don't have to use any advanced data compression algorithm, just adjust the length so that unused register bits are not saved/restored.

HubertLamontagne · Post by **HubertLamontagne** » 2020-03-17, 0:17:09

I'd imagine this would require special handling in the load/store unit if the vector isn't constant sized. Something where you get a kind of double-slot read micro-op or 2 micro-ops (there needs to be 2 reads if the image straddles a cache line boundary anyways), and if there's a potential read fault on the 2nd slot, it checks the data in the 1st slot to see if it really faults or not?

Alternatively, you could maybe have something like a vector load size predictor, that tries to guess how many cache lines to load and emits the right number of reads/micro-ops... and if something goes wrong (presumably it underestimated the load size or it faulted), it automatically does a L1 cache miss and it lets the cache refill unit deal with the fallout. Does this makes sense?

Kulasko · Post by **Kulasko** » 2020-04-01, 23:19:22

I'll use a load unit that emits two loads for now.
A load size predictor would certainly make sense for more elaborate designs. The naive way of first loading the size and then loading the data could cause a fairly long instruction latency, but I will go with this for now for simplicity's sake.

I found another problematic instruction, calls, especially the version with memory operand.
For a dual stack implementation, handling a spilling call stack would be fairly complex and might warrant microcode in this special case anyway (or a trap using the read_call_stack and write_call_stack system instructions).

For a unified stack implementation, the instruction consists of a jump as well as a push to the stack. For the call version with memory operand, the instruction has
- a memory load (jump base address)
- a memory store (return address)
- a register output (new stack pointer address)
The return is actually non-problematic since it consists of a memory read, a register output and a jump.

I understand the importance of call instructions, especially for the dual stack design, since it is the only way to manipulate the call stack in user mode. However, since the call with memory operand is non-optional, is there a way to fit this into a typical pipeline? Or if it warrants microcode because of its usefulness, wouldn't it be reasonable to also add push and pop instructions, since they could reuse a fair bit of the call/return functionality?

Post by **agner** » 2020-04-02, 4:47:12

Thanks for your analysis.

The ABI guarantees that you can read a maximum vector length beyond the actual data without going into non-available memory. The restore of a variable-length vector could take advantage of this and use a single read of the size of a length-specifier plus the maximum vector length, but this would require a data bus larger than the maximum vector length.

The dual stack implementation is preferred. A call with a memory operand (or a table) will use the on-chip call stack so it will only have a memory read. A pop register instruction would use the data stack, not the call stack, so it cannot use the same mechanism and it would violate the rule of only one output. Whether it is worthwhile to make an exception to the one-output rule for the pop instruction is an open question at the moment. Maybe the decoder could issue two µops for a pop instruction, but this would complicate the hardware pipeline.

I prefer to avoid microcode. The situations where you would use microcode are so few, that you could make a hard-wired state machine instead. At least, this is what I am imagining. I would prefer to have a state machine to handle the call stack spill situation.

Interrupts and traps should also use a state machine. In the semi out-of-order design (see chapter 8.1) I would like to have traps only in the in-order front end. Numeric faults are handled without traps, as I will discuss more later. The only problem here is jump/call with a memory operand or table. Either you would have the memory read in the out-of-order back end with a possible trap for an invalid target address, or you would have memory read in the in-order front end with a possible delay for cache misses. The latter solution is simpler.

HubertLamontagne · Post by **HubertLamontagne** » 2020-04-02, 17:22:53

Having instructions that require multiple micro-ops doesn't necessarily mean you need microcode. ARM has many instructions that are multiple-uop (for instance, load+increment a pointer, multiple register store/load...). "Call" and "Ret" are inherently multiple-uop, since you load/store a value and you update the stack pointer, which have a different latency.

Multiple-uop operations get a bad rep because they're used to plug the gaps in the modern x86 implementations, but they're not inherently bad.

Kulasko · Post by **Kulasko** » 2020-04-06, 15:26:44

Thank you for your replies.

HubertLamontagne wrote: ↑2020-04-02, 17:22:53 Having instructions that require multiple micro-ops doesn't necessarily mean you need microcode.

You are right, I didn't think about that.
I still have some difficulty dealing with this as it complicates the decoding stage. Presumably, decoders don't have a constant latency and output just zero or one operation per clock, but n operations. There then is some logic that takes all outputs of all decoders, ensures the correct order and arranges them for the first m operations to be pushed further along the pipeline.
Am I understanding this correctly?

agner wrote: ↑2020-04-02, 4:47:12 The ABI guarantees that you can read a maximum vector length beyond the actual data without going into non-available memory. The restore of a variable-length vector could take advantage of this and use a single read of the size of a length-specifier plus the maximum vector length, but this would require a data bus larger than the maximum vector length.

Oh yeah, I forgot about this part of the ABI. Also, apart from the case where a vector actually contains more than the maximum number of bytes minus the size of the length data, this could be a single read even if the data bus width only matches the maximum vector length. For the other case, it should be sufficient to let the load/store units or the cache controller split the request, as they have to do it anyway if the saved vector image crosses a cache line boundary.
To summarize, restore operations shouldn't be an issue because it is allowed to read the maximum vector size and discard the unused part. Only for a (almost) fully populated vector, a second read might be necessary.

agner wrote: ↑2020-04-02, 4:47:12 I prefer to avoid microcode. The situations where you would use microcode are so few, that you could make a hard-wired state machine instead. At least, this is what I am imagining. I would prefer to have a state machine to handle the call stack spill situation.

I agree with the use of state machines, they will certainly help to keep the main data path simple. However, they still produce some overhead for detecting an instruction that needs to use them, as well as the connecting to the other components. Detecting should be a non-issue for calls as they should be detected anyway for branch prediction purposes.
As for connections to the main data path, it certainly needs to access the LSUs. Depending on where the spilled stack is going to be saved, either the stack pointer needs to be reserved or, if the program shouldn't be able to manipulate it in any way, it needs its own space in a protected memory region. This would require trapping to a privileged mode. Trapping might be a good idea for simple designs and call stack spills because it requires no additional hardware, as ForwardCom already has "read_call_stack" and "write_call_stack" instructions. Is this right or did i miss something?

Post by **agner** » 2020-04-06, 17:12:35

Kulasko wrote:

Presumably, decoders don't have a constant latency and output just zero or one operation per clock, but n operations.

If an instruction generates two µ-ops, then you can put the second one in a pending position and stall the decoder in the next clock cycle. You can do the same with a tiny instruction pair.

Depending on where the spilled stack is going to be saved, either the stack pointer needs to be reserved or, if the program shouldn't be able to manipulate it in any way, it needs its own space in a protected memory region. This would require trapping to a privileged mode. Trapping might be a good idea for simple designs and call stack spills because it requires no additional hardware, as ForwardCom already has "read_call_stack" and "write_call_stack" instructions. Is this right or did i miss something?

Yes. Let's assume that we have an separate call stack that can spill into a protected memory region. The stack overflow could either be trapped or we could have a dedicated hardware state machine for stack spilling with access to the protected memory.

HubertLamontagne · Post by **HubertLamontagne** » 2020-04-06, 21:17:48

Kulasko wrote: ↑2020-04-06, 15:26:44 Thank you for your replies.

HubertLamontagne wrote: ↑2020-04-02, 17:22:53 Having instructions that require multiple micro-ops doesn't necessarily mean you need microcode.
You are right, I didn't think about that.
I still have some difficulty dealing with this as it complicates the decoding stage. Presumably, decoders don't have a constant latency and output just zero or one operation per clock, but n operations. There then is some logic that takes all outputs of all decoders, ensures the correct order and arranges them for the first m operations to be pushed further along the pipeline.
Am I understanding this correctly?

Yes. But what is a micro-op really depends on your type of pipeline. For in-order pipelines, it doesn't really matter what is a micro-op or isn't. You just push the next instruction in the pipeline as soon as it doesn't stall, and it doesn't really matter if the instruction is split in multiple parts internally or not, since the execution of different parts of the instruction is tightly coupled. You're probably only going to run 2 instructions per cycle maximum anyways (afaik), and probably only have a maximum of 1 memory load per cycle, so your front end doesn't really have to handle the really crazy cases (the back-end is going to stall anyways).

Micro-ops really make much more sense for out-of-order pipelines:

- Memory operations and ALU operations are scheduled completely differently, so generally, any operation that contains both parts has to be split into 2 micro-ops. A lot of the smaller out-of-order processors start all the memory operations in-order (but they can complete out-of-order), and inversely, ALU operations can easily be wildly reordered. The memory load can easily run 10's of cycles earlier than the math operation (if the math operation is waiting for the second operand), so the result of the memory load has to be placed in a register file of some kind anyways.

- Some x86 processors have a kindof 2-in-1 micro-op where the load and the ALU part of an instruction are still coupled together instead of being 2 totally separate micro-ops (with the AMD Athlon even having a 3-in-1 load-ALU-store micro-op). Afaik, the only difference between the 2-in-1 micro-op and 2 fully separate micro-ops is that in the 2-in-1 micro-op, the temporary inner value can be placed in a different register pool than the normal virtual registers and it doesn't have to compete for the same register retirement ports. Not all x86 CPUs have the 2-in-1 and 3-in-1 micro-ops. Out-of-order RISC cpus don't have these 2-in-1 micro-ops either and simply have more capacity for micro-ops internally (recent ARMs can run 6 micro-ops per cycle).

- Any instruction that stores 2 registers at the same time needs to issue 2 micro-ops. This is because register file write ports tend to be allocated per micro-op. (although there are special cases, such as the flags register engine on x86)

- Breaking instructions into micro-ops really helps deal with the latency of each sub-part of the operation. For instance, by splitting the RETURN instruction into a memory load, a jump, and an ALU micro-op to update the stack pointer, you can run each part with optimal latency (if the branch is correctly predicted). The stack pointer update runs really fast (since it has no dependencies except for the stack pointer) which lets the updated stack pointer be available very soon to instructions downstream. Likewise, the memory load gets started pretty fast, so that the CPU can quickly lock the load address against any intervening store operation. The jump part of the instruction can easily be delayed dozens of cycles (in the case where you get cache misses). This type of flexible scheduling is impossible to do with state machines.

It's true that this can easily add a stage or two to your pipeline, since you don't really know how many micro-ops are going to come out of a block of instructions. In particular, this means you don't know your micro-op mix: maybe you get 1 load 1 store 2 alu 2 vector (in which case it all goes to separate instruction queues so it's likely you can all issue them together), or maybe you get 6 loads (in which case you're going to get a stall no matter what!). I guess you could add a specialized cache to deal with it, similar to how the Athlon caches instruction length.

forwardcom forum

Possible difficulties for microcode-less implementations

Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations