More efficient ways of detecting exceptions

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

Post Reply
agner
Site Admin
Posts: 184
Joined: 2017-10-15, 8:07:27
Contact:

More efficient ways of detecting exceptions

Post by agner »

Floating point errors are traditionally detected in two ways: with a global status register or by traps (software interrupts). Both methods are problematic with out-of-order execution and vector processing (SIMD) for the following reasons.

A global status register has to be updated after every floating point operation. If multiple instructions are executed simultaneously or out of order, they all have to modify the status register. They may have to do so in order. Reading the status register is a serializing event: all preceding instructions have to retire before the status register can be read. The status register does not tell which instruction caused an exception. A vector instruction may generate multiple exceptions but set the status register only once.

Exception trapping is even more inefficient because exceptions must happen in order. All instructions must be executed speculatively so that they can be rolled back in case a preceding instruction, which has not finished yet, is causing an exception. A single instruction may be delayed for hundreds of clock cycles in case of a cache miss. The out-of-order scheduler may find many subsequent independent instructions that can be executed in the meantime. All these subsequent instructions must execute speculatively. This is an awful lot of bookkeeping.

A more efficient solution would be to propagate status information through a chain of calculations to the end result, as discussed in a previous thread viewtopic.php?f=1&t=91 and in the document https://www.agner.org/optimize/nan_propagation.pdf.
A propagation method would save a lot of silicon and power.

I will discuss three possible ways of propagating error status:

Method 1.
Floating point overflow is propagated as INF. Invalid operations are propagated as NAN. The error is detected in the end result.
Advantages:
  • This works with existing systems. Nothing new has to be introduced
  • The result of each element of a vector is reported separately. Scalar code can be vectorized without changing the result.
Disadvantages:
  • INF does not propagate through division: 1/INF = 0.
  • Underflow and inexact exceptions cannot be detected. These exceptions are rarely used, but they are required by the IEEE-754 floating point standard
  • Overflow in a float-to-int conversion cannot be detected with this method
Method 2.
Certain bits in a control register or mask register indicate what exceptions you want to detect. An enabled exception will generate a NAN result with a payload indicating the kind of exception and where it occurred.
Advantages:
  • Same advantages as method 1.
  • NANs can be detected with existing methods, including standard compare instructions
  • Underflow and inexact exceptions can be detected
  • Overflow generates NAN rather then INF to make sure it propagates through division
  • It is possible to detect where the error occurred. This is useful for debugging
Disadvantages:
  • Legacy code that relies on overflow generating INF may fail
  • Overflow in a float-to-int conversion cannot be detected with this method
Method 3.
All floating point/vector registers should have some extra status bits that are set in case of exceptions. The status bits are propagated through a series of calculations in the following way. An operation like C = A + B will set the status bits of C as the OR combination of the status bits of A and the status bits of B and the status resulting from the + operation. ForwardCom has special instructions for saving a variable-length vector register in a system-dependent compressed format. This instruction can include the status bits.
Advantages:
  • Works for integer overflow as well.
Disadvantages:
  • All vector registers must have extra bits
  • Vector registers can contain elements of 1, 2, 4, 8, or 16 bytes. Do we want status bits for all possible element sizes?
  • The status bits are lost when saving values in standard form
  • The status bits are difficult to access from high level language code
Any of these methods will make the hardware implementation much simpler and more efficient. There are certain problems, though, when using this with high-level language code. We need to discuss these problems.

The IEEE-754 floating point standard is making a distinction between immediate and delayed exception handling. The methods described here are perfect for delayed exception handling. You can simply check the result after a chain of calculations. The situation is more difficult if you want immediate exception handling. Immediate exception handling means that, in principle, you have to stop the series of calculations immediately in case of an exception. A high-level language may detect exceptions either by reading a status register or with a try/catch block. The status flag is the most common method of detecting floating point errors in C/C++, but try/catch is possible at least in some cases. Other languages like Java and C# are unable to raise and catch floating point exceptions, AFAIK.

Checking the end result with any of the above methods will work as useful replacements for a status register. The try/catch method is more difficult because it presupposes immediate exception handling. We may think of different scenarios with try/catch blocks:
  • The 'catch' block aborts the program with an error message. This situation is easy. All data are lost anyway, so it does not matter at what time the exception is detected.
  • The 'catch' block tries to recover from the error. The code assumes that all calculations before the exception are correct. We must roll back any calculations done after the point of the exception. Vectorizing a loop in this situation can be complicated, but in simple cases we may simply save the part of the result vector that precedes the element that indicates an error.
  • The 'catch' block tries to fix the error. The 'catch' block may access intermediate variables, including the value of a loop counter at the time of the exception. It may be very difficult to vectorize such a loop. The code may restore data to the state before the 'try' block and redo all the calculations without vectors.
Current compilers are unable to vectorize a floating point loop when exception trapping is on. Hence, you cannot vectorize a loop inside a try/catch block in current systems. You may be able to do so with the methods discussed here, but it will be difficult.

I would like to hear your opinions on which method of error detection to prefer for ForwardCom and any problems it may involve. Can we avoid speculative execution completely if traps are replaced by error propagation, and hardware interrupts are handled in an in-order front end?
agner
Site Admin
Posts: 184
Joined: 2017-10-15, 8:07:27
Contact:

Re: More efficient ways of detecting exceptions

Post by agner »

I have finally decided for method 2 mentioned above. This is now fully implemented in version 1.09 of the emulator and documented in the manual. This method makes sure that the detection of an exception is tied to the output of the specific instruction.

The result of an exception is either the default value or a NAN:

Floating point exception results
Event Exception disabled Exception enabled
Division by zero INF NAN
Overflow INF NAN
Underflow Subnormal or 0 NAN
Inexact Rounded NAN
Exceptions can be enabled in the floating point control register or in a mask register.

The NAN contains a payload value where the lower 8 bits indicate the type of exception and the remaining bits indicate the code address where the exception happened.

Some modifications have been made to make sure the NANs are propagating optimally. When two NANs are combined, e.g. NAN1 + NAN2, the one with the highest payload is propagated. The code address embedded in the NAN payload has all bits inverted so that lower addresses have preference. This makes it possible for a debugger to find the first exception in a sequence of code. The debugger may have to check for NANs before any backward jump to make sure it finds the first exception. The software may have to check for NANs before any instruction that does not propagate NANs, such as compare instructions and conversion to integer.

Conversion of a floating point number to a lower precision preserves the lower bits of a NAN payload, rather than the higher bits. Current CPUs with binary floating point representation are preserving the higher bits of a NAN payload. This behavior is undocumented and not standardized so I think it is perfectly acceptable to change it.

Conversion of floating point values to integers cannot generate NANs, of course. Instead, I have added various options to the float2int instruction for deciding what to do in case of overflow or NAN input. The software can detect overflow and NANs here by executing the float2int instruction twice with different options and comparing the results.

Integer overflow is detected with the instructions add_oc, sub_oc, mul_oc, div_oc. These instructions work on vector registers and use each odd-numbered vector element to indicate and propagate information about overflow in the preceding even-numbered vector element.
Post Reply