NAN propagation instead of fault trapping. Can we avoid speculative execution?

Post by **agner** » 2018-05-24, 11:07:49

Floating point calculations can generate infinity (INF) and not-a-number (NAN) in case of errors. These codes will propagate to the end result of a sequence of calculations in most cases. This is a convenient way of detecting floating point errors, and it is more efficient than using traps (software interrupts) for detecting numerical errors. Traps are particularly troublesome if vector registers are used.

The NAN code contains a payload of additional bits which can contain information about the kind of error that generated the NAN. The NAN payload can be very useful for error codes from mathematical function libraries. The IEEE 754 standard for floating point representation is incomplete with respect to the propagation of NAN payloads. I have discussed these problems with the working group behind the IEEE 754 floating point standard, but they do not want to make any modifications in a forthcoming revision of the standard because NAN payloads are rarely used today and it is difficult to predict future needs (http://754r.ucbtest.org/background/nan-propagation.pdf). The missing details can easily be specified for ForwardCom in order to make reliable propagation of INF and NAN.

I have written a paper describing the details of NAN propagation, including recommendations on how to use it and which compiler optimization options to use. You can find the paper here: agner.org/optimize/nan_propagation.pdf

I wonder if we need fault trapping at all in ForwardCom when NAN propagation is the preferred way of detecting floating point errors anyway. ForwardCom has options for trapping integer overflow as well, but most current microprocessor system have no such options and it is probably better to rely on special instructions for overflow detection instead.

A superscalar (out-of-order) processor will have less need for speculative execution if there is no fault trapping. See the thread "possible execution pipeline" at http://www.forwardcom.info/forum/viewtopic.php?f=1&t=78

We still need traps (software interrupts) for detecting illegal instructions and for memory access violations. Illegal instructions can be detected in the in-order front end. Memory addresses can also be calculated in the in-order front end if we follow the proposal for "control flow decoupling" in chapter 8.1 of the ForwardCom manual. Branching will also be in the in-order front end.

So I am wondering, is it possible to make a superscalar processor with no speculative execution at all? We can have speculative fetch, decode, and address calculation in connection with branch prediction, but stop speculating before the execute stage in the pipeline and before the out-of-order back end.

We still need hardware interrupts for servicing external hardware and for task switching. The interrupt can wait for the pipeline to be flushed if response time is not critical, or we could have one or more CPU cores reserved for servicing external hardware.

HubertLamontagne · Post by **HubertLamontagne** » 2018-05-24, 21:28:55

agner wrote: ↑2018-05-24, 11:07:49 So I am wondering, is it possible to make a superscalar processor with no speculative execution at all? We can have speculative fetch, decode, and address calculation in connection with branch prediction, but stop speculating before the execute stage in the pipeline and before the out-of-order back end.

It can be done but it's not for the faint of heart. Some approaches:

Shipped hardware:
ARM Cortex A8 : No speculation, in-order dual issue, unusually long pipeline with instruction queue to FPU/SIMD unit and write buffer
https://www.design-reuse.com/articles/1 ... essor.html
- Classic in-order design
- More aggressive operand forwarding makes it possible to have more pipeline stages before writeback
- Long latency instructions such as multiplications are possible. You get a stall if you read the results early
- FPU/SIMD unit runs after the writeback stage, so it cannot generate any interrupts or conditionals
- Delay between FPU/SIMD instructions vs regular instructions somewhat flexible
- Data written by the FPU/SIMD unit goes to a write buffer, with write addresses locked earlier in the pipeline.

Itanium : Pure software speculation, using explicit speculative load + advanced load + check instructions
https://blogs.msdn.microsoft.com/oldnew ... /?p=91181/
https://blogs.msdn.microsoft.com/oldnew ... /?p=91171/
- Every register has a NaT bit. Faulty speculative loads don't fault but set the NaT bit, check triggers software retry if something went wrong.
- Advanced loads use ALAT. Triggers software retry if something went wrong.

Transmeta Crusoe : Software speculation with hardware assisted commit/rollback, uses load-lock + store check + commit
https://pdfs.semanticscholar.org/presen ... 4db109.pdf
- Has 2 versions of the reg file, a temporary + a permanent
- Has a write buffer
- If there's an interrupt or any load instruction alias with a store, reg file is restored to permanent and writes in the buffer are discarded
- If execution reaches a commit, temporary registers are copied into permanent ones and write buffer contents is written

NVidia Denver: Supposedly similar to the Crusoe. There's very little information available about this one.

Research:
Idempotent Regions:
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
- Similar to Crusoe, but the idempotent regions replace explicit commit at end of regions

Decoupled Access Execute:
https://pdfs.semanticscholar.org/27c2/0 ... b18b1a.pdf
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
- Similar to ARM Cortex A8
- Has the most important part of OOO, the tolerance for memory load latency
- Code must be split into 2 parts, the part that generates memory accesses, and the part that uses loaded values and does calculations on them
- How to properly do speculation on this type of architecture is still an open problem

So, yeah it's totally possible. The real challenge IMHO is that the parts you have to add in to reduce the number of cases where you have hard stalls tend to increase complexity, and if complexity increases too much, you end up with something that's more complex than if you just went with OOO in first place (see Itanium for an example of this).

csdt · Post by **csdt** » 2018-05-25, 9:09:58

Hi Agner,

I really think that NaN propagation is much better than fault trapping.

However, I fail to see how avoiding fault trapping would remove the need for speculative execution...
I see why it reduces the complexity, but I have the feeling speculative execution is still needed: traps are not the only mechanism triggering speculative execution.

We still need hardware interrupts for servicing external hardware and for task switching. The interrupt can wait for the pipeline to be flushed if response time is not critical, or we could have one or more CPU cores reserved for servicing external hardware.

I would not recommend dedicating some cores to hardware handling: people will try to use them to unleash the full power of the CPU...
Are there any cases where an interruption response time is so critical that adding an extra 10-20 cycles is not affordable?

Post by **agner** » 2020-02-04, 6:38:51

This discussion is continued in a new thread. Different ways of detecting floating point exceptions and errors are discussed here: viewtopic.php?f=1&t=124

forwardcom forum

NAN propagation instead of fault trapping. Can we avoid speculative execution?

NAN propagation instead of fault trapping. Can we avoid speculative execution?

Re: NAN propagation instead of fault trapping. Can we avoid speculative execution?

Re: NAN propagation instead of fault trapping. Can we avoid speculative execution?

Re: NAN propagation instead of fault trapping. Can we avoid speculative execution?