Nonlocal control flow

Moonchild · Post by **Moonchild** » 2022-01-18, 21:21:52

How should nonlocal control flow (such as exceptions) be implemented on forwardcom? The manual refers to read_call_stack and write_call_stack, but those are privileged; and anyway they are somewhat heavy-handed if all you want to do is unwind the call stack a bit (while retaining most of its entries).

Post by **agner** » 2022-01-19, 7:51:14

You are right that stack unwinding requires privileged instructions.

Stack unwinding takes place in the following cases:

Exception trapping (try/catch) in object oriented languages
Debugging
longjmp in C

Exception trapping and debugging require privileged access anyway. A program that has a large number of taken exceptions will be inefficient. This should be avoided.

longjmp is mostly considered bad programming practice that should be avoided. Some coroutine libraries are using longjmp. It may be more efficient to use multithreading instead.

Moonchild · Post by **Moonchild** » 2022-01-19, 8:36:20

Nonlocal return is commonly used for ordinary control flow in e.g. common lisp; this restriction prohibits efficient cl implementations on forwardcom.

Exceptions and similar mechanisms are not so slow in GCed languages as in e.g. c++, because you do not need to do so much irrelevant bookkeeping to clean up stack-bound heap-allocated objects. The final cost should be little more than a branch mispredict.

Multithreading is not a viable alternative to coroutines; context switching is expensive and synchronization is expensive. I agree many current applications of coroutines might be better replaced by threads, but coroutines are still a useful programming feature that should not be discarded.

Generalising from coroutines, lack of cheap access to the call stack inhibits efficient implementation of continuations. I think the only practical solution given current ISA is a full CPS transform, which is likely to be even more expensive due to mispredicts on indirect calls.

Post by **agner** » 2022-01-19, 11:09:59

I think that nonlocal returns may be implemented more efficiently as a sequence of normal returns. It is possible to have multiple data stacks on ForwardCom. The compiler could use a register or an extra stack or a particular stack space to keep track of the nesting level.

Error handling can also be implemented without exception traps. My favorite method is to let the function that detects an error first call a global error function and then return an error code to tell the caller to ignore the result. (See the chapter "Error message handling" in the forwardcom manual).

I don't understand why coroutines are needed. I would use other methods such as a class object that can be accessed whenever a particular task needs to be updated. Can you give an example where coroutines are necessary?

Moonchild · Post by **Moonchild** » 2022-01-19, 19:16:26

agner wrote: ↑2022-01-19, 11:09:59 I think that nonlocal returns may be implemented more efficiently as a sequence of normal returns.

1. Why? Why is it more efficient to return multiple times than just once?

2. How would that work without a branch after every call?

agner wrote: ↑2022-01-19, 11:09:59 Error handling can also be implemented without exception traps. My favorite method is to let the function that detects an error first call a global error function and then return an error code to tell the caller to ignore the result. (See the chapter "Error message handling" in the forwardcom manual).

I don't agree, but that's beside the point. I will just ask: is it a goal of forwardcom to support efficient implementation of languages where this is not the norm? If the answer is no, I will shut up.

agner wrote: ↑2022-01-19, 11:09:59 Can you give an example where coroutines are necessary?

It is useful for emulators, simulating components which run in parallel. This article discusses in more detail.

Systems such as erlang rely on extremely efficient context switching between 'threads', as they encourage a parallel programing style that leads to large numbers of these threads. I do not believe hardware context switches can be faster than software ones, even if they are more efficient than on traditional architectures.

Post by **agner** » 2022-01-20, 7:37:28

Thank you for trying to find weak points in my system.

I think that nonlocal returns may be implemented more efficiently as a sequence of normal returns.
1. Why? Why is it more efficient to return multiple times than just once?

Because it doesn't mess up the return prediction. A non-local return will cause not only this return, but all pending returns to be mispredicted on CPUs that have a return prediction mechanism, which all high-end processors have today.

2. How would that work without a branch after every call?

Yes, you need a branch to check if the return value is valid. A good branch predictor may be able to predict this.

I will just ask: is it a goal of ForwardCom to support efficient implementation of languages where this is not the norm?

The chapter "Error message handling" in the ForwardCom manual recommends a standardized way of handling errors in order to improve compatibility between different user interface paradigms, including command line programs, GUI programs, and server mode programs. Some programming languages may have their own standards for error messages, which would override the ForwardCom recommendation.

Can you give an example where coroutines are necessary?
It is useful for emulators, simulating components which run in parallel.

Why not make each component an object in an object oriented language? The whole context of a component is stored in the object. There is no cost of context switching - you just call a member function (method) of each object.

Systems such as erlang rely on extremely efficient context switching between 'threads', as they encourage a parallel programing style that leads to large numbers of these threads. I do not believe hardware context switches can be faster than software ones, even if they are more efficient than on traditional architectures.

I agree that hardware context switching = thread switching, is inefficient. Synchronization and communication between threads, in particular, is inefficient. Software switching will be more efficient if you have just one CPU core. This can be implemented as a message loop. If you want preemptive multitasking, you may need a separate stack for each process, but this should be no problem in ForwardCom as long as the size of each stack is limited.

Moonchild · Post by **Moonchild** » 2022-01-20, 8:18:07

agner wrote: ↑2022-01-20, 7:37:28A non-local return will cause not only this return, but all pending returns to be mispredicted on CPUs that have a return prediction mechanism, which all high-end processors have today.

I was going to mention this, but I forgot. Isn't this a case where forwardcom would do much better than traditional cpus? Because the call stack is expressed explicitly, it is the same as the return prediction stack. So after you install a new call stack, you maybe pipeline stall, but then start start predicting returns correctly from it.

you need a branch to check if the return value is valid. A good branch predictor may be able to predict this.

Branch prediction is fine, but I worry about locality. Optimizing compilers can move error-handling code far away from the hot path, so it does not pollute cache. This adds appreciable space overhead to every call.

(Though, on a related note, what do you think about branch hints? Mainly thinking about avoiding BTB bloat rather than mispredictions.)

emulators
Why not make each component an object in an object oriented language? The whole context of a component is stored in the object. There is no cost of context switching - you just call a member function (method) of each object.

The article I linked discusses this. The context required is complex, and it is very cumbersome to represent it explicitly rather than implicitly on the call stack.

Post by **agner** » 2022-01-20, 9:12:45

Isn't this a case where forwardcom would do much better than traditional cpus? Because the call stack is expressed explicitly, it is the same as the return prediction stack. So after you install a new call stack, you maybe pipeline stall, but then start start predicting returns correctly from it.

Yes, exactly.

Instructions for manipulating the call stack do not have to be privileged, I just thought it would be safer. If there is a significant need, then it would be possible to make the privileged status of these instructions changeable. I just don't see such a need. Error handling is a rare event that is not time-critical.

what do you think about branch hints?

Intel had branch hint prefixes on the NetBurst architecture, which was not very successful. They haven't used these prefixes since then. A branch hint only affects the prediction the first time a branch is met, which is irrelevant for overall performance. You still need a BTB entry if the hint says branch 'taken'. It is more efficient to organize code so that a forward branch is not taken most of the time.

ForwardCom branch instructions have two vacant bits that can be used for hints in the double-size branch instructions, but not in single-size branch instructions. So it is possible to implement branch hints. I just don't think it is worthwhile. I haven't seen any successful use of branch hints that actually matter in terms of performance. There are various patents, but I don't think they are useful.

HubertLamontagne · Post by **HubertLamontagne** » 2022-01-24, 19:18:11

I imagine these usages could be done with an OS call but wouldn't require a full interrupt. So you'd have an intermediary performance level (privilege level change, but no full pipeline-flush-and-context-switch from an interrupt). Which I guess still makes sense since longjmp and triggering exceptions doesn't happen too often (though try-catch blocks are more common, so perhaps grabbing the stack pointer needs to be faster). Something like:

uint64_t getIpStackPointer();
void setIpStackPointer(uint64_t newOffset);
void copyFromIpStackMemory(uint64_t offsetInStack, uint64_t nbQWords, uint64_t *dataOut);
void copyToIpStackMemory(uint64_t offsetInStack, uint64_t nbQWords, const uint64_t *dataIn);

Moonchild · Post by **Moonchild** » 2022-01-31, 1:02:01

agner wrote: ↑2022-01-20, 9:12:45 A branch hint only affects the prediction the first time a branch is met, which is irrelevant for overall performance. You still need a BTB entry if the hint says branch 'taken'. It is more efficient to organize code so that a forward branch is not taken most of the time.

Interesting, thanks--does that mean there is no BTB cost for forward never-taken branches?

Post by **agner** » 2022-01-31, 7:16:09

does that mean there is no BTB cost for forward never-taken branches?

On some processors, yes.

forwardcom forum

Nonlocal control flow

Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow

Re: Nonlocal control flow