The original proposal, 2 years ago

Post by **agner** » 2017-11-02, 12:32:59

Author: Agner Date: 2016-08-01 01:48
I have made an introduction to the ForwardCom project at http://www.forwardcom.info.

I have added a few optional instructions to facilitate matrix multiplication.

Author: EricTL Date: 2017-07-17 18:15
Great proposal so far with a lot of interesting ideas.
May I suggest a couple of instructions:
-Bit reverse: Reverse all the bits of a register.
This is very useful to enable bit hack for the most significant bits.

-LUT base logical ops: Rd, Rs Imm: Rd OPS Rs -> Rd Where OPS is a 8 bits LUTs implementing any three input one output logical expression.
4 more bits of the immediate field will specify msb, lsb and the following 3 cases:
OPS(Rd, msb:Rd >> 1, Rs) -> Rd
OPS(Rd, Rd:lsb << 1, Rs) -> Rd
OPS(Rd:lsb << 1, msb:Rd >> 1, Rs) -> Rd
That instruction could replace all the other logical instructions. And, Or, and-not, ... and adds some new one.

-MultiCarryOps:
Rd ops Rs under Rc (Carry mask): Ops can be +, - >>, <<, rot The carry mask work as follow
Each bit set in the Rc indicate the msb of a sub bit string. i.e. the carry does not propagate to the left bit of a set bit in the carry mask.
The idea is to enable efficient bitfield ops.
A compressed version could use force a bit field to be at least two bits and offer Rd ops Rs where Rs 'or' together the carry mask and a single bit after each Cm bit set.
e.g. Cm = 0x0420: 0x7C1F + 0x0631 => 0x8020; 0x7C1F - 0x0631 => 0x77FE; 0x7C00 - 0x0230 => 0x7FE0
These instructions will enable a better use of bit-fields and enable more complex loop count in one instructions. Also if made atomic they could enable new lock free algorithms.

Author: Agner Date: 2017-07-18 00:28
EricTL wrote:

May I suggest a couple of instructions:
-Bit reverse: Reverse all the bits of a register.

This instruction exists. It is called bit_reverse.

LUT base logical ops:

Instruction truth_tab2 is a two-input LUT.
Instruction truth_tab3 is a three-input LUT.

These instructions implement arbitrary functions for boolean vectors. However, the function is applied only to the least significant bit of each vector element. There are two reasons for this limitation. (1) This is the way boolean vectors are represented in ForwardCom. The boolean value is stored in the least significant bit. The remaining bits are option bits, which can be propagated to the result and used for specifying options to e.g. a floating point instruction using the LUT result as a mask. (2) The 3-input LUT can be implemented in hardware by using an 8-bit barrel shifter where the 8 bits of the LUT are shifted right by the count defined by the three input bits. The hardware allready has an 8-bit barrel shifter for each 8-bit vector element so that the implementation is cheap. A LUT function that applies to all bits of a vector will require one 8-bit barrel shifter for each vector bit, which would be more expensive in terms of silicon space.

I am not sure I understand what you want to do with the msb, lsb inputs?

MultiCarryOps:
Rd ops Rs under Rc (Carry mask): Ops can be +, - >>, <<, rot The carry mask work as follow
Each bit set in the Rc indicate the msb of a sub bit string. i.e. the carry does not propagate to the left bit of a set bit in the carry mask.

Interesting idea. But this has no high level language equivalent. It will only be used by programmers who master the use of intrinsic functions or assembly language. A simpler solution is to just leave one unused bit between each bitfield to avoid carry from one bitfield to the next.

BTW, I have made the move_bits instruction for efficient insertion and extraction of bit fields.

Author: EricTL Date: 2017-07-20 19:19
I thought I replied already. Maybe I did not sent it.

Agner wrote:

Instruction truth_tab2 is a two-input LUT.
Instruction truth_tab3 is a three-input LUT.

These instructions implement arbitrary functions for boolean vectors. However, the function is applied only to the least significant bit of each vector element. There are two reasons for this limitation. (1) This is the way boolean vectors are represented in ForwardCom. The boolean value is stored in the least significant bit. The remaining bits are option bits, which can be propagated to the result and used for specifying options to e.g. a floating point instruction using the LUT result as a mask. (2) The 3-input LUT can be implemented in hardware by using an 8-bit barrel shifter where the 8 bits of the LUT are shifted right by the count defined by the three input bits. The hardware allready has an 8-bit barrel shifter for each 8-bit vector element so that the implementation is cheap. A LUT function that applies to all bits of a vector will require one 8-bit barrel shifter for each vector bit, which would be more expensive in terms of silicon space.

I am not sure I understand what you want to do with the msb, lsb inputs?

Let's put aside the shit left or right by one (the immediate lsb and msb where there to feed the shift on both end of the register. )

A 4 bit LUT can be made with 3 multiplexers (mux.) per bit.

I contend the logical part of the ALU will need at least 3 mux. per bit to choose from A | B, A & B, A & ~B and A ^ B.
So in essence for logical encoded instruction the 2 mux. control bits comme from the instruction while for the LUT base the two control bits come from the registers.

In the simplest implementation the 4 logical ops will be done first then the mux4:1 will choose which one.

In the LUT case while the registers are being read the immediate 4 bits for the LUT are amplified and broadcasted to the mux4:1.

In modern CMOS on SOI a mux2:1 is just 2 transistors. So if one counts two inversors (2 Tr x 2 ) for every 8 bits => 4 x 4 x (64 / 8) = 128 Tr for the broadcast
plus 64 x 6 Tr for the mux4:1

Total 512 Tr for the universal solutions vs a minimum of 64 x (6+1) for the mux4:1 part plus the cost of the four logical ops...
Granted it is over simplified but I guess anyone can get the just of it.

Moreover with the LUT solution one get also Not, Clear, Id (move) plus any other less common logical expression
and in a two registers instruction settings ( Rd = Rd ops Rs ) the LUT enable (Rd := Rd ops Rs) and (Rd := Rs ops Rd).

The case for 8 bits LUT (universal three-input logical operation) is less clear. The mux is now 8:1 => 14 Tr and the broadcast doubled to 256 Tr => 1152 Tr.
I still think that instruction can be leveraged by a compiler to, at the very least, make the code smaller.

To finish on a rather hackish and somewhat silly note In the case of the tiny instruction format the 4 bits LUT can be implemented by using the code for Rs as both the register and the LUT. In that form Clear can be implemented as LUT R0 therefore reclaiming the opcode.

Not opcode can be reclaimed as well by LUT R5 (LUT convention b3,b2,b1,b0 -> 0101)
Freeing a opcode for something else. A clever compiler should be able to play with the limitation somewhat and get some further optimization.

Author: Agner Date: 2017-07-20 23:27
I don't like the idea of using the same bitfield as both instruction and register. You will not have a free choice of what register to use, and the out-of-order scheduler will have problems detecting whether it should wait for the value of that register or not.

Post by **agner** » 2017-11-02, 12:41:44

Author: Joe Duarte Date: 2016-08-04 01:23
Agner, there are a couple of specialized instruction domains that you seem to be pushing to FPGAs or other coprocessors, but I think ForwardCom will suffer from their exclusion.

Encryption may be specialized, but it's not optional anymore. Fast encryption is going to be a central requirement of any computing device for decades. So the lack of AES and similar instructions is a disadvantage. You could say that FPGAs can do it, but computers today are expected to be able to do this without extra silicon. FPGAs are not in anyone's Bill of Materials for a smartphone, PC, or server (well, in rare cases for servers), and there's still a lot of doubt over whether there are enough programmers who know how to fully leverage FPGAs.

Parsing text and bytecode is a huge part of what computers do every day, so I think a new ISA should take a clean look at how to speed up parsing (and JIT'ing). You don't have any counterpart to the SSE 4.2 string comparisons, and I wouldn't want to punt that to FPGAs either. I know that programs rarely use those SSE 4.2 instructions today, but that sort of thing will become more common in the future as there is more pressure on energy use in cloud computing and battery life on mobile. At some point someone is going to build a clean-sheet web server that fits entirely in the L2/L3 cache.

Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?

Author: Agner Date: 2016-08-04 14:37
Joe Duarte wrote:

Fast encryption is going to be a central requirement of any computing device for decades. So the lack of AES and similar instructions is a disadvantage.

You are probably right that AES instructions would be useful.

Parsing text and bytecode is a huge part of what computers do every day, so I think a new ISA should take a clean look at how to speed up parsing (and JIT'ing). You don't have any counterpart to the SSE 4.2 string comparisons, and I wouldn't want to punt that to FPGAs either.

Text strings are rarely so long that it really matters. It will take less time to parse a string than to read it from an SSD anyway.

x86 AES and SSE4.2 instructions are still only 128 bits while most other vector instructions are 256 - and soon 512 - bits. This is because they are costly to implement. In most cases, it will be faster to parse a string using simple compare instructions with the maximum vector length than a SSE4.2 instruction limited to 128 bits.

Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?

Instruction fetch and decoding is a bottleneck on all Intel processors. They are putting a micro-operation cache after the decoder to fix this, but each micro-op uses many bits so physical distances on the silicon chip is a limitation. The ForwardCom instruction format will be more efficient than this because it can decode fast and it is compact. x86 has a lot of suboptimal designs because it needs to preserve backwards compatibility with a long legacy of short-sighted designs and patches. They even keep supporting undocumented opcodes from the old 8086.

It is difficult to predict how much you can gain from having longer vectors. Some applications will be unable to use long vectors. The special vector loop will be an advantage whenever the loop count is not certain to be a multiple of the vector size.

Author: Hubert Lamontagne Date: 2016-08-05 12:25
Joe Duarte wrote:

Agner, there are a couple of specialized instruction domains that you seem to be pushing to FPGAs or other coprocessors, but I think ForwardCom will suffer from their exclusion.

Encryption may be specialized, but it's not optional anymore. Fast encryption is going to be a central requirement of any computing device for decades. So the lack of AES and similar instructions is a disadvantage. You could say that FPGAs can do it, but computers today are expected to be able to do this without extra silicon. FPGAs are not in anyone's Bill of Materials for a smartphone, PC, or server (well, in rare cases for servers), and there's still a lot of doubt over whether there are enough programmers who know how to fully leverage FPGAs.

What about a specialized AES encryption device, accessed using memory mapped I/O like other peripherals like DMA controllers, timers etc? Wouldn't that do the job, and keep the overall complexity lower? The OS would probably have to make sure no 2 programs are using the AES device at the same time but otherwise it would be pretty straightforwards: set the key, push the data through. Afaik that's close to how ARM does it.

Joe Duarte wrote:

Parsing text and bytecode is a huge part of what computers do every day, so I think a new ISA should take a clean look at how to speed up parsing (and JIT'ing). You don't have any counterpart to the SSE 4.2 string comparisons, and I wouldn't want to punt that to FPGAs either. I know that programs rarely use those SSE 4.2 instructions today, but that sort of thing will become more common in the future as there is more pressure on energy use in cloud computing and battery life on mobile. At some point someone is going to build a clean-sheet web server that fits entirely in the L2/L3 cache.

Text parsing is hard to speed up. It tends to be made up of loops full of loads, stores, and conditional branches, and not very many of the kind of mathematical operations that can be sped up. It's inherently serial in nature. Afaik, there are only 2 approaches for this at the moment:

- If your load is to run a whole bunch of programs at the same time (= it's a server), you can use the UltraSparc design: a really simple in-order CPU that uses tons of hyper-threading, and every time something stalls just switch to the next thread. The simple in-order core takes a lot less space so you can cram a whole bunch of them on the same chip, which lets you compensate for speed deficiencies.

- Otherwise, just use a general purpose out-of-order CPU that's as fast as possible. x86 fits this profile perfectly (which is why it's so popular still) and ARM is evolving towards that too. This is part of why RISC is popular: starting with a classic 4-issue out-of-order CPU like the Dec Alpha 21264, it's very hard to find new instructions to add that will actually speed it up at text processing and not be overly specific, and afaik the stuff that speed it up the most are just making branch prediction and data cache as fast as possible (look at x86 evolution for a perfect example of this). Truth is, if your text processing program has the kind of stuff that doesn't run fast on x86, chances are, it's doing stuff like defeating the branch predictor or data caching, and it's probably not going to run fast on ANY hardware. In fact, there are even good chances that it's not even filling the 4-instructions per cycle limit that modern Pentium-2 descendant cores can do.

Bytecode interpreters fall into a similar category of load-store-jump code and are hard to speed up for similar reasons. A way to put it is that while you can use parallelism to increase the number of calculations per second, you can't use it to increase the number of decisions per second, which tends to be limited by electrical constraints (clock rate, number of cycles of latency on the data cache, number of cycles of latency on issue - which can be reduced with a simpler instruction encoding but not past a certain limit).

I guess there's the special exception of the CPU designed to use dynamic translation - like the Transmeta Crusoe or NVidia's Denver (which in theory the Nintendo NX could use). The catch is that even then, these cores tend to perform good on loopy-mathy code, but not very well on loady-story-jumpy code (which is why the Denver performs better than ARM cores on benchmarks but worse in real world applications), and bytecode tends to be used for loady-story-jumpy code. It also doesn't help that high level programming languages tend to have all sorts of features that make the language inherently slow (one example that comes to mind is Python's data structures that I think use pointers to pointers just for getting to an object's data) and hard to dynamically compile (because they allow dynamic typing hacks, which makes it hard to protect the interpreter against corner cases).

Joe Duarte wrote:

Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?

That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core. My hunch is that it would run at about the same speed as x86, but simply require less design time to implement all the crazy instruction unpacking and 286-era segmenting intricacies. I don't think ForwardCom would scale very well past 4 instructions per cycle, but to be fair I don't think any instruction set does (they tried to do it with Itanium and failed).

Imho most of the ideas and innovation and potential speed gain in ForwardCom is in the vector instructions. This means that it has good potential for numeric-heavy applications (video games! physics!), if you use vectors, which generally means you have to program in C++ and use intrinsics (or use assembly). A few C++ compilers can auto-vectorize (Intel's compiler, MSVC 2012 and later and GCC with all speed options turned on) but it has to be "trivial loops" - if any of the memory accesses alias any other memory access in the loop, or there's any sort of carried state, then it's impossible to vectorize and the compiler will issue normal instructions (ie scalar integers and floats). Basically the compiler has to prove that each loop iteration is 100% independent from other iterations, which generally only happens in numeric-heavy code looping over a buffer and applying exactly the same mathematical operation on each iteration and data coming in neat linear arrays.

For higher level languages (anything dynamic typed like javascript or pyton etc), those are not designed for vectorization in any way so it would probably run 100% scalar. For middle level languages (Java, C#), those in theory are vectorizable but only in the same cases that C++ is vectorizable (trivial loops doing numerical computation), and generally people use C++ (video games) or Fortran (scientific computing) for those applications so there's not really any point AFAIK.

Author: Agner Date: 2016-08-06 02:44
Hubert Lamontagne wrote:

What about a specialized AES encryption device, accessed using memory mapped I/O like other peripherals like DMA controllers, timers etc? Wouldn't that do the job, and keep the overall complexity lower? The OS would probably have to make sure no 2 programs are using the AES device at the same time but otherwise it would be pretty straightforwards: set the key, push the data through. Afaik that's close to how ARM does it.

That's a pretty expensive solution. I wouldn't recommend a clone of the x86 AES instructions either, because they have long latencies which is bad for the smooth pipeline structure of ForwardCom. I would prefer to split up the AES operations into small unit operations with single-clock latency. If the length of vector registers is at least 2048 bits and we have full byte permute instructions then you can have a complete 256 byte lookup table contained in one vector register and do the table lookup with a byte permute instruction. The whole AES algorithm could then be implemented with general permute and XOR instructions without any special-purpose instructions. This would be a flexible solution that could be adapted to other encryption algorithms as well. The price is a large permutation circuit.

Text parsing is hard to speed up. It tends to be made up of loops full of loads, stores, and conditional branches, and not very many of the kind of mathematical operations that can be sped up. It's inherently serial in nature.

The branch misprediction penalty is equal to the length of the pipeline. The best way to speed up unpredictable branches is to have a short pipeline or to speculatively execute both sides of a 2-way branch simultaneously. Some text processing jobs can be vectorized, using masked operations etc. UTF-8 conversion can be done with the compress_sparse and expand_sparse instructions in ForwardCom (sorry, they are poorly described in the manual).

Bytecode interpreters fall into a similar category of load-store-jump code and are hard to speed up for similar reasons. A way to put it is that while you can use parallelism to increase the number of calculations per second, you can't use it to increase the number of decisions per second, which tends to be limited by electrical constraints

ForwardCom has a multiway branch instruction which is useful for byte code interpreters, but I think the way to speed up byte code is JIT complation.

Joe Duarte wrote:

Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?

Hubert Lamontagne wrote:

That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core.

The target of ForwardCom is indeed to beat the performance/price ratio of existing systems. It is designed with a simple and consistent pipeline structure. It is doing one thing taking one clock cycle at each pipeline stage, except for multiplication, division, and a few other necessary things.

We should not forget, however, that in many applications the bottleneck is not the CPU but cache, memory access, disk access, or network speed.

Author: fanoI Date: 2016-08-08 06:00
Hello Agner,

what about an hardware accelerated Decimal Floating Point Unit as present in IBM Power CPUs? In particular for business applications the decimal type is necessary but now being done in software is 100x slower that float / double!

Author: Agner Date: 2016-08-08 10:26
fanoI wrote:

what about an hardware accelerated Decimal Floating Point Unit

The x86 instruction set has instructions for integer decimal arithmetics. However, these instructions are removed in the 64-bit mode because they are never used.

The normal procedure is to convert decimal to binary, do calculations in binary, and convert the result to decimal. These conversions can be vectorized to speed up the process. You can find code examples for this in my vector class library. Anyway, decimal numbers are for human use and the bottleneck will always be the human, not the computer

Author: fanoI Date: 2016-08-09 13:51
Agner wrote:

fanoI wrote:

what about an hardware accelerated Decimal Floating Point Unit
The x86 instruction set has instructions for integer decimal arithmetics. However, these instructions are removed in the 64-bit mode because they are never used.

The normal procedure is to convert decimal to binary, do calculations in binary, and convert the result to decimal. These conversions can be vectorized to speed up the process. You can find code examples for this in my vector class library. Anyway, decimal numbers are for human use and the bottleneck will always be the human, not the computer

I think the instructions you are referring were for the old BCD mode I was referring to the actual Decimal Floating Point:
https://en.wikipedia.org/wiki/Decimal_floating_point

they exist in C as GCC extensions (_Decimal32, _Decimal64 and _Decimal128), in C++ as decimal32, decimal64 and decimal128 and in C# there is the Decimal type. To do calculation with these types you need to resort to do them in software in x86 CPU (in Power CPU IBM has added a Decimal Floating Point Unit do this in hardware) they are used for financial calculations where powers of 10s are more important that power of 2s

Author: Joe Duarte Date: 2016-08-08 18:41
Hubert Lamontagne wrote:

Text parsing is hard to speed up. It tends to be made up of loops full of loads, stores, and conditional branches, and not very many of the kind of mathematical operations that can be sped up. It's inherently serial in nature. Afaik, there are only 2 approaches for this at the moment:

- If your load is to run a whole bunch of programs at the same time (= it's a server), you can use the UltraSparc design: a really simple in-order CPU that uses tons of hyper-threading, and every time something stalls just switch to the next thread. The simple in-order core takes a lot less space so you can cram a whole bunch of them on the same chip, which lets you compensate for speed deficiencies.

- Otherwise, just use a general purpose out-of-order CPU that's as fast as possible. x86 fits this profile perfectly (which is why it's so popular still) and ARM is evolving towards that too. This is part of why RISC is popular: starting with a classic 4-issue out-of-order CPU like the Dec Alpha 21264, it's very hard to find new instructions to add that will actually speed it up at text processing and not be overly specific, and afaik the stuff that speed it up the most are just making branch prediction and data cache as fast as possible (look at x86 evolution for a perfect example of this).

I'm thinking of approaches Parabix: http://parabix.costar.sfu.ca/
One of their papers: https://www.cs.sfu.ca/~ashriram/publica ... arabix.pdf

I guess it just comes down to vector instructions and their parallel bitstreams approach. (Another way to boost it would be multiplexed streams, but that's a different plane of the architecture than the ISA.)

Hubert wrote:

That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core. My hunch is that it would run at about the same speed as x86, but simply require less design time to implement all the crazy instruction unpacking and 286-era segmenting intricacies. I don't think ForwardCom would scale very well past 4 instructions per cycle, but to be fair I don't think any instruction set does (they tried to do it with Itanium and failed).

Well, Broadwell-E is a high target right now, but the target is moving. It's going to be a low target in 2020. Of course Agner's not necessarily thinking in terms of marketable silicon, but more toward a useful reference and engineering exercise. I'm just wondering what kind of performance wins are possible. Another way of framing it: If a well-funded team rolled up their sleeves and built a new general purpose CPU architecture from scratch, including a new ISA and tape out and all that, using TSMC or Samsung's 16/14 nm process, could they beat Intel/Skylake? Maybe Intel is wringing all the performance out of current processes that the physics allows, and the legacy technical debt is only a design cost overhead. They seem to be treading water since Haswell, so I sometimes wonder if a better architecture would make a difference or if the wall is more fundamental than Intel.

By the way, do you think Itanium was just too early, with too little tooling? Does an out of order CISC have any inherent advantage over VLIW or EPIC? It seems goofy to send one instruction at a time. I haven't found any deep-dive post-mortems on Itanium, just vague claims that no one could build a good compiler for it, or that the physical product just wasn't that fast. I think we're going to see some big advances in compiler technology in the near future, with powerful platforms like IBM Watson doing things a laptop-bound compiler can't do.

By the way, regarding the idea of specialized AES encryption device, Intel offered QuickAssist on the v2 and v3 Xeons: http://www.intel.com/content/www/us/en/ ... rview.html

They don't seem to offer it on the Broadwells. It's a dedicated accelerator for encryption and compression, but I've never found a detailed review anywhere. Do either of you know more about it? I'm surprised it didn't get more attention.

Author: Hubert Lamontagne Date: 2016-08-09 17:49
Joe Duarte wrote:

I'm thinking of approaches Parabix: http://parabix.costar.sfu.ca/
One of their papers: https://www.cs.sfu.ca/~ashriram/publica ... arabix.pdf

I guess it just comes down to vector instructions and their parallel bitstreams approach. (Another way to boost it would be multiplexed

streams, but that's a different plane of the architecture than the ISA.)

Hmm, that's an interesting technique. Very reminiscent of how they used to do graphics on the Amiga and in EGA (which worked well for 2d but had to be abandoned once they started doing 3d).

Joe Duarte wrote:

Hubert wrote:

That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core. My hunch is that it would run at about the same speed as x86, but simply require less design time to implement all the crazy instruction unpacking and 286-era segmenting intricacies. I don't think ForwardCom would scale very well past 4 instructions per cycle, but to be fair I don't think any instruction set does (they tried to do it with Itanium and failed).
Well, Broadwell-E is a high target right now, but the target is moving. It's going to be a low target in 2020. Of course Agner's not necessarily thinking in terms of marketable silicon, but more toward a useful reference and engineering exercise. I'm just wondering what kind of performance wins are possible. Another way of framing it: If a well-funded team rolled up their sleeves and built a new general purpose CPU architecture from scratch, including a new ISA and tape out and all that, using TSMC or Samsung's 16/14 nm process, could they beat Intel/Skylake? Maybe Intel is wringing all the performance out of current processes that the physics allows, and the legacy technical debt is only a design cost overhead. They seem to be treading water since Haswell, so I sometimes wonder if a better architecture would make a difference or if the wall is more fundamental than Intel.

That's a good question. I think that's exactly what ARM is trying to do! In fact, there are a whole bunch of ARM licensees that are taking a stab at this, and imho it's proving to be a little bit harder than expected...

The catch is that once you have a cpu as big as a Broadwell, it has tons of stuff in it: in particular, a very complex memory architecture with a whole laundry list of features: aggressively pipelined multibank data cache with 2 read ports and 1 write port, TLB, out-of-order non-blocking everything with a memory ordering buffer, prefetching, multi-core cache coherency protocol, handling of misaligned addresses... This is basically almost totally independent of the architecture - you could implement all the same stuff on an ARM or MIPS or POWER core and it would be just as complex.

If you factor in the other stuff like aggressively pipelined secondary units like the FPU and SIMD unit, heavy branch prediction, hyper threading... this is all stuff that's basically the same on other architectures. The only thing that x86 changes is the front end. The front end part of a CPU isn't THAT large and complex compared to all the rest on modern cores with so many features.

And if, say, some new silicon process changes the distribution of bottlenecks so that faster instruction decoding becomes more important again, then Intel can simply put in a larger or broader micro-code cache, or maybe even go back to the instruction trace cache like on the Pentium 4. (ok, the P4 is a bit reviled, but the truth is that it still held on pretty good compared to the other well designed architectures of the time)

Joe Duarte wrote:

By the way, do you think Itanium was just too early, with too little tooling? Does an out of order CISC have any inherent advantage over VLIW or EPIC? It seems goofy to send one instruction at a time. I haven't found any deep-dive post-mortems on Itanium, just vague claims that no one could build a good compiler for it, or that the physical product just wasn't that fast. I think we're going to see some big advances in compiler technology in the near future, with powerful platforms like IBM Watson doing things a laptop-bound compiler can't do.

I think it boils down to a couple things... The first generation Itanium came out late. Because of this, RISCs and x86s of the day had an extra generation to catch up, which means that the first generation Itanium totally failed to impress. Second generation came out pretty fast and sorta fixed this, but Itanium could never quite reach up to the speed of the fastest RISCs and x86s. One guy said that it was never as fast as HP's previous architecture (PA-RISC, totally a classic RISC).

Second thing, Itanium was basically betting that out-of-order would fail to deliver enough speed gains to keep the RISCs and x86s relevant. It turns out that out-of-order delivers speed gains at just the right place: memory accesses. Real life programs have lots of memory accesses going on, which gives out-of-order a fairly hard-to-beat edge. It's just easier to deal with memory latency with an out-of-order pipeline than with the Itanium's insane ALAT thing with checked speculative loads and stores and whatnot. Late Itaniums can issue 12 instructions per cycle, which x86 will probably never be able to do, but even though x86 only issues 4 instructions per cycle, 3 of those can have a built-in memory access which can be crazy-reordered by the aggressive out-of-order pipeline, and it turns out that that's enough to smoke the Itanium.

There's a question on Quora about exactly this and it has tons of interesting answers:
Why did the Intel Itanium microprocessors fail? https://www.quora.com/Why-did-the-Intel ... ssors-fail

Author: Joe Duarte Date: 2016-08-11 19:09
Hubert Lamontagne wrote:

The catch is that once you have a cpu as big as a Broadwell, it has tons of stuff in it: in particular, a very complex memory architecture with a whole laundry list of features: aggressively pipelined multibank data cache with 2 read ports and 1 write port, TLB, out-of-order non-blocking everything with a memory ordering buffer, prefetching, multi-core cache coherency protocol, handling of misaligned addresses... This is basically almost totally independent of the architecture - you could implement all the same stuff on an ARM or MIPS or POWER core and it would be just as complex.

If you factor in the other stuff like aggressively pipelined secondary units like the FPU and SIMD unit, heavy branch prediction, hyper threading... this is all stuff that's basically the same on other architectures. The only thing that x86 changes is the front end. The front end part of a CPU isn't THAT large and complex compared to all the rest on modern cores with so many features.

And if, say, some new silicon process changes the distribution of bottlenecks so that faster instruction decoding becomes more important again, then Intel can simply put in a larger or broader micro-code cache, or maybe even go back to the instruction trace cache like on the Pentium 4. (ok, the P4 is a bit reviled, but the truth is that it still held on pretty good compared to the other well designed architectures of the time)

By the way, this point and your other point on dynamic translation, Transmeta, etc. remind me of the recent buzz around Soft Machines. I don't think I've seen any discussions of Soft Machines here. In particular, I was reminded of Ian Cutress saying in this article that Soft Machines was more than ten years behind Intel on speculation and branch prediction: http://www.anandtech.com/show/10025/exa ... e-visc-ipc

That really struck me as illustrating how much accrued work and research has gone into a modern Intel CPU, and that an ISA might be a relatively minor piece.

While I have you, what goes into implementing new instructions? What's the physical manifestation of an instruction? Is there one, or is it typically just microcode? I'm always amazed by how more instructions constantly added, and that they consistently speed things up.

Author: Agner Date: 2016-08-12 00:11
Joe Duarte wrote:

what goes into implementing new instructions? What's the physical manifestation of an instruction? Is there one, or is it typically just microcode? I'm always amazed by how more instructions constantly added, and that they consistently speed things up.

My idea is not to use microcode in ForwardCom at all (and no microcode cache, of course).
Intel and AMD use microcode only for complicated instructions that use many µops, as microcode is quite inefficient. There are thousands of different instruction codes in x86 and most of them are implemented directly in silicon. That must be a lot of silicon! I am sure that this is slowing things down. Some of these instructions were added mainly for marketing reasons, providing no real advantage. Many instructions are obsolete, but they keep supporting them for backward compatibility.

Author: Hubert Lamontagne Date: 2016-08-12 18:13
Joe Duarte wrote:

what goes into implementing new instructions? What's the physical manifestation of an instruction? Is there one, or is it typically just microcode? I'm always amazed by how more instructions constantly added, and that they consistently speed things up.

Afaik, generally, there are '4 levels of implementation' for chips:

- FPGA: You write your logic in Verilog or VHDL, compile it to the FPGA, then run it.

- FPGA to ASIC conversion: This is basically an ASIC with a bunch of FPGA-like cells, but where the metal interconnection layer is customized. This has much lower fixed costs than an ASIC, but there's only a speed gain of about 30% and a cost gain of about 50% over FPGAs.

- ASIC: The Verilog / VHDL code is compiled to a proper circuit layout. Most chips use this. This makes much faster chips and has a much lower unit cost, but the fixed costs are very high ($1million+, need lots of units). There needs to be lots of verification.

- ASIC with additional custom-layout parts: Same as above but some parts are custom-laid out rather than automatically. AFAIK Intel does this for the most speed-critical parts of its CPUs (like the L1 Data Cache) and it is part of its advantage over AMD (I've read somewhere that AMD has stopped doing custom layout).

More info: http://electronics.stackexchange.com/qu ... -asic-made

Within the chip, instructions that have the same memory access pattern and latency often are interpreted more or less the same, with the ALU doing the different possible arithmetic operations in parallel and then having a multiplexer at the end that selects which result it uses. In Verilog, it looks something like this:

case (alu_op)
0: alu_out <= b;
1: alu_out <= a + b;
2: alu_out <= a - b;
3: alu_out <= a & b;
4: alu_out <= a | b;
5: alu_out <= a ^ b;
6: alu_out <= a >> b;
7: alu_out <= a << b;
8: alu_out <= a >>> b;
9: alu_out <= a >= b ? 1 : 0;
10: alu_out <= ($signed(a) >= $signed(b)) ? 1 : 0;
default: alu_out <= alu_out; endcase

Agner wrote:

My idea is not to use microcode in ForwardCom at all (and no microcode cache, of course).
Intel and AMD use microcode only for complicated instructions that use many µops, as microcode is quite inefficient. There are thousands of different instruction codes in x86 and most of them are implemented directly in silicon. That must be a lot of silicon! I am sure that this is slowing things down. Some of these instructions were added mainly for marketing reasons, providing no real advantage. Many instructions are obsolete, but they keep supporting them for backward compatibility.

Yeah, no microcode is a nice goal to strive for. Though that requires a certain amount of "instruction set discipline" : no long latency or multi-step or multi-µop operations, or operations that load/store an excessive number of registers or do multiple memory accesses, or special-case exceptions, or instructions that can have all sorts of strange side effects deep in the pipeline and have to be specially executed in-order (like CPU mode manipulation using the flag registers for instance)...

Author: grant galitz Date: 2016-08-22 22:53
Pardon my slight skimming of the contents of this thread.

I was wondering if adding the ability for the amount of registers visible to be user programmable, like THUMB on ARM, was overviewed prior to forwardcom's writing. I could see special branch instructions upgrading or downgrading the number of registers to fine tune the instruction caching/fetching vs work done per instruction as a still useful tool. I'm also curious as to the implication this would have on context switching overhead, and any hardware logic issues.

Author: Agner Date: 2016-08-22 23:59
grant galitz wrote:

I was wondering if adding the ability for the amount of registers visible to be user programmable, like THUMB on ARM, was overviewed prior to forwardcom's writing.

The instruction code template has 5 bits for each register operand. This gives 25 = 32 registers.

Adding more registers would make the instructions bigger. Fewer registers would be wasteful. If the number of registers is variable then all the software, including function libraries and system code has to be recompiled every time the number of registers is changed to optimally use the number of available registers. I think it would be too complicated to reconfigure register space as cache or something else. This would require a lot of extra wiring between otherwise unrelated parts of the chip.

Author: Hubert Lamontagne Date: 2016-08-24 00:33
grant galitz wrote:

Pardon my slight skimming of the contents of this thread.

I was wondering if adding the ability for the amount of registers visible to be user programmable, like THUMB on ARM, was overviewed prior to forwardcom's writing. I could see special branch instructions upgrading or downgrading the number of registers to fine tune the instruction caching/fetching vs work done per instruction as a still useful tool. I'm also curious as to the implication this would have on context switching overhead, and any hardware logic issues.

One reason ARM has a 16bit instruction mode (THUMB) is that not all ARM cores have an instruction cache. If you're loading instructions from RAM or slow ROM (sometimes with a 16bit bus), you really really want instructions to be as small as possible because it's so costly to load them up. THUMB mode was often used on the GBA when running code from the slower RAM or ROM, which had a 16bit bus. This is also why the SuperH has 16bit instructions as well.

Having an explicit switch between ARM and THUMB mode simplifies the pipeline because it removes the need to have a prefetch buffer to handle unaligned instructions. Later on, they added THUMB-2 mode which removed this limitation and would let the program mix up 16-bit and 32-bit instructions.

The catch with small instructions is that 16-bit THUMB code runs significantly slower than ARM on modern cores that have instruction caches and large, 64bit buses, since so many instructions need to be broken up. And also, THUMB-2 code doesn't run faster than ARM in most applications, in spite of taking less instruction cache. One guy comparing the linux kernel size and the Coremark benchmark has:

vmlinux-ARM32: text 8.4mb, data 463kb, bss 827kb, total 9.7mb
vmlinux-thumb2: text 6.7mb, data 463kb, bss 827kb, total 8.0mb

The best is -O3 -funroll-loops -marm -march=armv5te -mtune=cortex-a8
The best armv7-a is -O3 -funroll-loops -marm -march=armv7-a -mtune=cortex-a8 at 95.2 % of overall best
The best Thumb-2 is -O3 -funroll-loops -mthumb -march=armv7-a -mtune=cortex-a8 at 88.7% of overall best
The best Thumb-1 is -O2 -mthumb -march=armv5te -mtune=cortex-a8 at 64.4% of overall best

(Source)

So being able to lower the number of registers below 32 to shrink opcodes on a CPU design that's targetting large machines with lots of ram doesn't seem too relevant to me...

Now, increasing the number of registers is another question and I'm curious about what people have to say about that.

Post by **agner** » 2017-11-02, 12:44:32

Author: Agner Date: 2016-08-23 00:24
Two people have sent me this link. Yesterday, ARM announced a future extension with variable vector length:
ARM Announces ARM v8-A with Scalable Vector Extensions http://www.anandtech.com/show/10586/arm ... ata-center

The details are not available yet, but it appears that the vector length must be a multiple of 128 bits. Maybe it must be a power of 2:
Technology Update: The Scalable Vector Extension (SVE) for the ARMv8-A architecture https://community.arm.com/groups/proces ... chitecture

This new ARM extension has the same idea as ForwardCom that the same code should be able to run on different processors with different vector length. But the hardware seems much more complicated than ForwardCom. The ARM extension requires that the hardware scheduler is able to split long vectors into shorter ones or merge short vectors into longer ones. The compiler would also be quite complicated, I think. The special ForwardCom loop type is much more elegant and efficient with no limit to vector length, see forwardcom.info.

Author: Jorcy Neto Date: 2016-08-23 06:11
Agner wrote:

But the hardware seems much more complicated than ForwardCom. The ARM extension requires that the hardware scheduler is able to split long vectors into shorter ones or merge short vectors into longer ones. The compiler would also be quite complicated, I think. The special ForwardCom loop type is much more elegant and efficient with no limit to vector length, see forwardcom.info.

Whereas splitting wider vector/SIMD instructions into narrower execution units is quite straightforward (just like early SSE/SSE2 in PIII/P4), merging narrower vectors for wider units really seems way more complex.

Author: Hubert Lamontagne Date: 2016-08-26 11:21
Agner wrote:

Two people have sent me this link. Yesterday, ARM announced a future extension with variable vector length:
ARM Announces ARM v8-A with Scalable Vector Extensions.

The details are not available yet, but it appears that the vector length must be a multiple of 128 bits. Maybe it must be a power of 2:
Technology Update: The Scalable Vector Extension (SVE) for the ARMv8-A architecture.

This new ARM extension has the same idea as ForwardCom that the same code should be able to run on different processors with different vector length. But the hardware seems much more complicated than ForwardCom.

Well, the ARM version is also somewhat more aggressive since it tries to deal with scatter-gather, carried loop dependencies (aka feedback) and data-dependent loop exits.

Agner wrote:

The ARM extension requires that the hardware scheduler is able to split long vectors into shorter ones or merge short vectors into longer ones.

Hmm, I'm re-reading the article, and now I'm really wondering what they actually mean about how it adapts to different vector sizes. Either:
- It's a software scheme (in which the software knows the vector size, and if you have, say, 512 bit vectors, then the top 1536 bits always read as 0) and the article is slightly wrong in the way it presents things.
- Or, vectors can always be 2048 bits and it simply uses more micro-ops on narrower processors (this is already how NEON works) - this relatively straightforwards to implement, although this leaves the issue of having to deal with oversized register files (I'm not up to date on how costly this is in hardware).
- Or maybe it's some other crazier scheme? I can't quite tell.

Agner wrote:

The compiler would also be quite complicated, I think.

Auto-vectorization is always complicated!

I'm not sure exactly why the ARM version is more complicated than ForwardCom though. Yes, the ARM version requires a more complex calculation to get vector length and turn it into lane flags, although there's a good chance this extra calculation can run in parallel with other stuff and essentially be free. The ARM version doesn't require transforming the loop index 0..n-1 into a negative index from end of buffer, which I'm sure is doable in SSA form but might get complex for loops with lots of references to the indexing variable.

Agner wrote:

The special ForwardCom loop type is much more elegant and efficient with no limit to vector length, see forwardcom.info.

It's true that the ARM version doesn't have the insight that if you use reverse indexing from the end of the buffer, you can clamp the index and use it as a count for the number of items per iteration. If I'm reading the news post correctly, I think the ARM idea is that they use the lane predication flags as vector sizes, so they probably need a couple of specialized instructions to produce the correct clamped item counts and masks. But they're doing both predication and variable vector size using the same hardware, which I guess is a good thing.

Maybe they're using general-purpose integer registers as lane masks - that would explain the 2048bit size limit (64bit register = 64 lane flags * 32bit float per line = 2048bits). The load/store unit needs to know the lane masks, so they need to be known early in the pipeline, and I don't think it's using too many general-purpose register file ports.

Author: Jorcy Neto Date: 2016-12-20 07:22
Agner wrote:

Two people have sent me this link. Yesterday, ARM announced a future extension with variable vector length:
ARM Announces ARM v8-A with Scalable Vector Extensions.

The details are not available yet, but it appears that the vector length must be a multiple of 128 bits. Maybe it must be a power of 2:

Most of the HotChips yearly presentations are made publicly available at each late year. Now you can find further details on SVE at :
http://www.hotchips.org/wp-content/uplo ... 51-v11.pdf

Post by **agner** » 2017-11-02, 12:49:02

Author: Hubert Lamontagne Date: 2016-09-05 01:53
One recurrent question: How will ForwardCom run Linux's mmap function?

Author: Agner Date: 2016-09-05 05:12
You can map a virtual address space as unused, read only, write only, or read/write. In this way you can generate traps on first read and first write to a block of memory. If the mapped file is bigger than available RAM and you are writing in a random fashion then it will be more likely to flush a dirty part of the area to the file than to fragment the memory. The number of memory mapped files you can have is obviously limited by the size of the memory map.

I wouldn't recommend the use of memory mapped files with ForwardCom. Now that solid state disks are becoming popular (and bigger) there is hardly any point in using memory mapping for the purpose of minimizing disk head movements.

Author: Hubert Lamontagne Date: 2016-09-05 18:33
There's more to mmap:

- It's can be used to allocate memory, page by page (for instance as a back-end for malloc()).
- It can be used to memory map hardware devices. For instance, sometmes video device VRAM is mapped by mmap()ing /dev/mem at 0xa0000.
- It can be used for inter-procedure shared memory: if multiple programs mmap() the same file, they will actually get a mapping to the same block of RAM.
- The fork() function relies on paging to do a shallow copy of the whole memory space, and then selectively does a real deep copy of a page when one of the forks writes to it and triggers a page fault.
- Demand paging of newly allocated memory lets the OS deal with badly behaved memory allocators, for instance when the Java virtual machine allocates 512mb of ram on startup. Executable data of large shared libraries like QT can also be loaded on-demand this way.
- When RAM runs out, mmap()ed memory that is backed by disk files and doesn't need to be written to swap-space can be priorized.
- Also, mmap() also works well with SSDs.

Author: Agner Date: 2016-09-06 08:51
Hubert Lamontagne wrote:

There's more to mmap

I was not aware of these details.

Many of the applications of mmap you mention look like hacks in my opinion. A redesign of the memory management system should offer genuine solutions to these problems rather than hacks.

Malloc should be implemented in a way that matches the new memory management system.

Video ram should be accessed only through a device driver. This device driver may share video memory with an application only if the OS allows an application to monopolize the screen.

Sharing of memory between processes is described in the ForwardCom manual. It should be done in a standardized way for security reasons.

Demand paging and memory swapping to disk are separate problems that should have dedicated solutions without using mmap.

Author: Bigos Date: 2016-09-06 16:19
I don't think you will be able to remove all usages of mmap(). Memory mapping is very convenient and if a microarchitecture doesn't allow for it won't be used much for general purpose computing. Programmers are already familiar with it and created a lot of utilities that use it. For example, you couldn't port most of the general purpose OSes to such an architecture, since they are based on virtual memory (there is a special fork of linux called uclinux [1] that doesn't need an MMU, though it's specialized for embedded solutions).

For example, 3D graphics APIs have a memory buffer map functions that are used to map part of GPU memory into the process address space. Since this part is already exclusive to the application (it has been allocated previously), there is no harm in bypassing the device driver (which performs the map) and it lets the user perform reads and writes with higher performance. A solution with a shadow buffer that is copied around when needed is possible, but would reduce performance.

A mechanism used by the Mill CPU family uses a protection data that maps which memory is to be accessible by which application [2]. It's stored in a PLB (protection look-aside buffer) that, unlike TLB, is not on a critical path since it only has to prevent a failing writes from affecting other CPUs and make failing reads cause a fault - no address translation is performed. Maybe such a mechanism would be a good compromise here? The OS (or any other entity) would be able to let another one access its memory on demand.

Mill also has a memory translation table, but the TLB sits outside of the core, near a memory controller. The mapping is global, not per-process (no TLB flush on context switch needed) and does the translation only when an access goes to the memory, at which point the latency is already high so making the TLB fast is not really that important. All the data in the caches is virtually addressed, virtually tagged.

But I repeat myself here. We could come up with a better solution instead, though it doesn't appear that it will be simple.

Author: Hubert Lamontagne Date: 2016-09-06 22:41
Ok I checked, there are 2 OS calls to allocate RAM in Linux: sbrk() and mmap(). They serve as a back-end to malloc(). The "heap" segment shown on some Linux memory map diagrams is the section allocated by sbrk(), and the rest of user-allocated ram is allocated using mmap().

The current default implementation of malloc() is ptmalloc2 and it uses both sbrk() and mmap(), but in different scenarios:
- Small to medium allocations on the main thread use sbrk(), because it doesn't matter too much if some blocks get free()'d later on since the allocator can still reuse the leftover holes for allocations of the same size.
- Large allocations on the main thread use mmap(), because sbrk() can only reclaim free()'d memory in the exact reverse order that it was allocated, so that a large allocation can easily become impossible to reclaim if it's followed by small longer-lived allocations.
- Allocation on other threads use mmap() instead of sbrk(), because only mmap() is thread-safe. Also, using sbrk() would create a series of segments that are used by different threads on the heap, which would make it hard to reclaim memory until all the threads have finished running.

Author: Agner Date: 2016-09-07 00:10
Hubert Lamontagne wrote:

Ok I checked, there are 2 OS calls to allocate RAM in Linux: sbrk() and mmap().

Thanks for the clarification. sbrk would be easy to implement in ForwardCom. And it could be thread safe because each thread has its own memory map and its own private memory in ForwardCom. Everything that causes a heavy fragmentation of memory for the same thread is a no go. In some applications, the limitation can be circumvented by making multiple threads - one for each block of shared memory.

Author: Hubert Lamontagne Date: 2016-09-07 17:18
Agner wrote:

And it could be thread safe because each thread has its own memory map and its own private memory in ForwardCom.

Ah, but if your threads don't share memory, then they're not threads, they're separate processes, which is a different thing!

Author: Agner Date: 2016-09-08 00:39
Hubert Lamontagne wrote:

if your threads don't share memory, then they're not threads, they're separate processes, which is a different thing!

ForwardCom has a special security feature for threads. A process can have a parent thread and several child threads. All child threads have access to the memory of the parent thread. The stack, heap and thread-local storage of a child thread is private by default, not accessible to its parent and siblings. This can be useful if the process is servicing several clients. Communication between threads - e.g. a semafor - must use the parent's memory space.

Thread memory is shared only if this is necessary for compatibility with legacy software.

Author: Hubert Lamontagne Date: 2016-09-08 17:24
Agner wrote:

ForwardCom has a special security feature for threads. A process can have a parent thread and several child threads. All child threads have access to the memory of the parent thread. The stack, heap and thread-local storage of a child thread is private by default, not accessible to its parent and siblings. This can be useful if the process is servicing several clients. Communication between threads - e.g. a semafor - must use the parent's memory space.

Thread memory is shared only if this is necessary for compatibility with legacy software.

If you want to port anything that uses threads on a POSIX OS (which all use the pthreads library) or on Windows, the expected behavior is that all threads share the same memory yes. For instance, if multiple threads can modify a std::vector (protected by a mutex), then if the vector changes size and its memory is reallocated, all threads have to see the newly allocated memory regardless of which thread caused the size change. Since std::vector memory allocations ultimately use malloc(), this means that malloc() has to return memory which is visible to all threads by default.

Author: Commenter Date: 2016-09-07 19:28
I don't consider myself well informed on this topic, so please excuse me for any mistakes.

there is hardly any point in using memory mapping for the purpose of minimizing disk head movements

I've actually never heard of mmap being useful for that purpose? mmap is useful for reading files as it avoids an extra memory copy. When a file is mapped, the userland application essentially can read directly from the kernel's disk cache buffer. On the other hand, with standard read() calls, the kernel has to make an additional copy into the userland supplied buffer. This works similarly for write() calls.
As I/O gets faster, reducing this kind of overhead becomes more important.

In fact, I'd imagine that with faster NVRAM style devices, mmap may allow the file to be read directly off the I/O device without ever being read into system RAM.

Author: Bigos Date: 2016-09-08 13:04
Commenter wrote:

In fact, I'd imagine that with faster NVRAM style devices, mmap may allow the file to be read directly off the I/O device without ever being read into system RAM.

Actually, this is what Linux DAX infrastructure does. It allows you to map nvram directly to the process's address space.

Post by **agner** » 2017-11-02, 12:56:46

Author: Kurt Baumgardner Date: 2016-09-09 23:40
I have received this mail from Kurt Baumgardner and I think it belongs on the forum so that others can join the discussion:

Hi Agner,

My name is Kurt Baumgardner, and I've been reading your processor design draft. Very impressive! Quite obviously a labor of love. You've done your research, which is also obvious from the rest of your site. I highly respect your work and fndings.

Please understand, I don't mean to be critical. But I had to mention my worry about the decision to drop paging from the memory model, which I think is going to be missed. In a handful of 'controversial' choices, you provided a logical rationale for going with one method vs. another, but, either I missed it, or you didn't go into detail about why no paging. I think it is related to your application memory map model, which, if I read correctly, is a simpler (and possibly tighter) app-based, "just what's needed" model. Such a model runs counter to a typical paging scheme, where most mallocs receive pages vs. arbitrary-sized blocks. I will be the first to admit that I must read that section a few more times And, even then, it's a deep subject to follow.

Regardless, I feel that, from what I understand so far, the memory model might be a bit too simple. You described an idea of the compiler calculating the needed stack size, and/or profiling to attempt to determine the needs. And, this is made easier with the 2 stack approach (which I really like). I understand this pre-calculation approach for optimization, but is its purpose also to help reduce complexities that would make your memory model seem less atractive? (I'm not sure if I described that understandably).

Let me take a step back. My biggest issue/fear/disappointment is with the removal of paging. It always nice to fit everything in memory (64Mb is enough for anyone, right? , but you can't always do it. With all the ideas of future compatibility, like with software automatically taking advantage of larger vectors, the loss of paging seems like a huge step backwards, as it seriously limits machines without large amounts of installed memory. I know of a small percentage of server apps that allocate memory with disregard for actual physical memory size, depending on and expecting the paging hardware and software to handle the otherwise messy details of managing extremely large datasets.

In my opinion, 2 of the most important inventions of computing are: #1: using free memory as a disk cache, and #2: built-in automatic paging. There's really no other practical way to provide apps the freedom to allocate memory as needed. Apps have no business trying to home bake their own paging scheme.

Without automatic paging, all the work you did for future compatibility is wasted. Why? Because developers will not write the big programs that could take advantage of future improvements, because they're afraid of running out of memory!

I know the goal is to simplify the system, for elegance, for cost, for power efficiency, for performance, and many other good reasons. But, your future chip must at least be able to support the hugely-successful and useful capabilities of today's processors. The loss of paging cripples the system severely as a whole.

Or, maybe I'm missing something huge. With the current spec, would you simply have to crash a launching (or running) program with an "Out-of-memory" error? Memory is handled so well currently that many programmers don't even test for the possibility of memory allocation to fail (which is awful, but also amazing that they can get away with it).

Thanks for your time reading, and, if you have a minute or two, please let me know what you think, or your rationales, or your thoughts. If you wish, this text could be placed into your forum. Have a nice day!

By the way, I *really* like the idea of programmer-built extensions to the instruction set. I would love to have the following capabilities:

1. Software gate connection (direct gate array access).
2a. Microcode programming (if so equipped), or
2b. the ability to control load/store, ALU, bit shift, cache control, vectoring, etc hardware.
3. An on-chip cache dedicated to user instructions. Could be used for extremely fast lookups/translations, pre-calculated math, etc.

(one can dream

Kurt

Author: Agner Date: 2016-09-10 00:18
Kurt Baumgardner wrote:

my worry about the decision to drop paging from the memory model,

The current systems have a very complicated memory paging systems with huge two-level page tables which are partially cached in the on-chip TLB. This system is complicated and using a lot of resources.

My goal with ForwardCom is to see what can be gained by a complete redesign. The paging system seemed an obvious candidate for simplification so I got the idea of replacing the huge number of fixed-size pages with a small number of variable-size pages. It was just an idea that I wanted to discuss, and as you can see it has already generated a lot of discussion.

This memory map system will only work if we can avoid heavy memory fragmentation, but if it works it will be a great advantage. The memory map will be so small that it can be contained entirely on the chip. You can have a separate memory map for each thread and you can easily switch memory map on system calls and task switches.

An application can still allocate memory dynamically, but preferably in a way that doesn't cause too much fragmentation. If an application allocates a huge amount of memory, not knowing how much it will actually need, then the OS may just allocate virtual memory in the first place. When physical memory is needed, it will be allocated in blocks of exponentially growing sizes (each new block will be bigger than the previous one) in order to minimize memory fragmentation. As a last resort, the system may move data around in memory if a contiguous virtual memory space has become mapped to a too fragmented physical memory. Moving data in memory may be fast because the data path to cache (and to main memory?) can be as wide as the maximum vector length.

The advantage of having variable size memory map entries is that you will need much fewer entries. The disadvantage is that you need a lot of comparators to check all entries in one or a few clock cycles. The actual number of entries we will have depends on the hardware implementation and this is subject to experimentation. A well behaved application that causes little or no memory fragmentation will need very few memory map entries. So a programmer who makes a performance-critical application should exercise a little discipline to avoid excessive memory fragmentation.

Author: Hubert Lamontagne Date: 2016-09-11 13:40
I guess the system is somewhat close to software TLB schemes like on the MIPS. The difference is, of course, that on software TLB schemes, pages must have 2^N sizes and be aligned. The advantage is that you don't need a full comparator per TLB entry and a Content-addressable memory is sufficient, which means a larger TLB is possible - MIPS classically had a 64 entry TLB.

Variable page size is not that well established as a technique: large pages don't work that well with copy-on-write techniques, they don't deal well with fragmentation of memory, larger pages tend to be split up by operations like mmap / munmap / mprotect... Generally, memory contiguity is a limited ressource, with contention not only at program-scale but also at global OS-scale, and the general response seems to have been to keep using 4k pages and increase the size of TLB caches.

When you're adding multiple page sizes, you're introducing some hard NP-complete problems to the OS's page allocator, since it can't just have a list of available free pages that it distributes to programs willy-nilly, and you have to have all sorts of page-contiguity preserving algorithms. And I'm pretty sure that it's often not practical to allocate memory by enlarging already allocated pages because this only works if the contiguous memory is unallocated, which becomes not only a program-scale problem but an OS-scale problem.

Moving data in memory is NOT a good idea IMHO, no matterr how fast you can copy. Especially for applications like games, which can use hundreds of mbs of RAM and CANNOT have 5ms and 10ms stalls.

Author: Kurt Baumgardner Date: 2016-09-13 00:12
I don't think it's possible to both be thrifty with memory, and prevent memory fragmentation, from an application programmer's point of view. You can just allocate a huge heap at the start of your program, and do all your work in that heap, thereby preventing fragmentation. Or you can just use what's currently needed, releasing memory back to the OS, but creating the potential for fragmentation. As a programmer, personally I don't want to have to think about how the OS gives me my memory, and I want the OS to make use of memory I've released. And, in today's OSes, the method for providing memory to programs is a magic black box - you ask, and the OS gives. You really don't have a lot of control of how that occurs.

It would be nice to be able to provide hints: "I need a temporary chunk of memory that I'll give back soon", or, "I'll be needing a contiguous chunk of X size for a while."

I assume we're talking about a replacement for X86/X64, for servers and desktops. I can symphathize with the goal of optimizing, and wanting to keep everything simple, and that's a noble cause. But, before making it optimal, it has to work first (and by that, I mean it has to be practical, and be able to handle today's apps.)

Consider a web browser. Navigate to a text page (or a forum), and that browser may need a couple of Mb of memory. But, then navigate to an image-laden page, and BAM! You now need dozens of Mb to render all those images into. Some browsers delay the rendering of JPGs, for example, until you actually scroll down to where the JPG is visible. And if you keep scrolling, they toss that rendering from memory.

My point is that, some programs have very complicated memory allocation needs, and that does not suggest that they are written poorly. Those web browser designers are also optimizing, but they are concerned with memory usage, load time, render speed, and much more. Forcing them to allocate memory a different way limits their abilities. And, without paging, they are forced to be thrifty with memory. Which causes fragmentation. Full circle.

If this chip is to sit in people's homes across the world, you've got to expect that many thousands of different programs will be running on them. And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.

In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe. And, even trying to be respectful of the machine's resources seems difficult, as I am torn between releasing no-longer-needed memory (which is supposed to be the right thing to do), and trying to prevent fragmentation (which is, essentially, the opposite, unless I've missed something).

Back to the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

2. If not #1, then some really good cache hint/directive instructions.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design. But, maybe one has merit, so I present them for thought. I find the whole project fascinating, and I hope one day that you can get chip to the fabrication stage! Best of luck!

Author: Agner Date: 2016-09-13 00:39
Kurt Baumgardner wrote:

And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.

In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe.

Some performance critical programs are written by highly skilled programmers who want to tweak as much performance out of the system as possible. Depending on the application, they may be able to predict how much memory they will need and allocate it all at once. Such applications would only need a handful of memory blocks which could easily be handled by an on-chip memory map. Other programs, probably less critical, are made with point-and-click tools and abstract frameworks by people who have no idea what is going on behind the scenes. They may cause heavy memory fragmentation without even knowing it. Maybe we can have a dual system with an on-chip memory map for well-behaved applications and a partially software-based paging system which will be activated only if the memory becomes too fragmented. Programmers who make well-behaved applications will be awarded with superior performance.

Back to the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

2. If not #1, then some really good cache hint/directive instructions.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design.

These ideas are not farfetched, and some have already been implemented in various systems. Swapping register banks was even supported in some processors back in the 1980s. Putting RAM on the chip is an obvious thing to do for the less memory-hungry applications. The more RAM you put on the chip the slower it will be, so you need one or more levels of cache in between anyway.

Author: Kurt Baumgardner Date: 2016-09-13 19:08
Agner wrote:

Kurt Baumgardner wrote:

And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.

In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe.
Some performance critical programs are written by highly skilled programmers who want to tweak as much performance out of the system as possible. Depending on the application, they may be able to predict how much memory they will need and allocate it all at once. Such applications would only need a handful of memory blocks which could easily be handled by an on-chip memory map. Other programs, probably less critical, are made with point-and-click tools and abstract frameworks by people who have no idea what is going on behind the scenes. They may cause heavy memory fragmentation without even knowing it. Maybe we can have a dual system with an on-chip memory map for well-behaved applications and a partially software-based paging system which will be activated only if the memory becomes too fragmented. Programmers who make well-behaved applications will be awarded with superior performance.

Believe me, I like the idea of the super-fast memory map, and the dual system idea is interesting. I appreciate a system that remembers to give the programmer the keys to the kingdom, so to speak. Cause most systems provide a dumbed-down, common-denominator approach. But a dual system? That makes the system that much more complicated, doesn't it.? Maybe OS drivers and OS-protected low-level stuff could use the on-chip stuff exclusively. Otherwise, I'll be pushing to have my Super Zombie Crusher 3 pushing OS stuff out to let my 100-fps renderer live on-chip, right?!! What programmer would choose the slow model? I don't know - the more I think about it, the more I lean towards the traditional model. Yes, it's slow and complex. But, that's because it's necessary, to get the power you want and need. I think Intel got it right, somewhat...it's hard to deny. Simplifying it upfront leads to more complex schemes down the road. To me it seems inevitable. Now, please reconsider on-chip mass memory, which could alleviate this issue and others - see below.

Backto the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

2. If not #1, then some really good cache hint/directive instructions.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design.

These ideas are not farfetched, and some have already been implemented in various systems. Swapping register banks was even supported in some processors back in the 1980s. Putting RAM on the chip is an obvious thing to do for the less memory-hungry applications. The more RAM you put on the chip the slower it will be, so you need one or more levels of cache in between anyway.

On the subject of massive on-chip memory: It's slow. Ok. But, how slow?. I really want to see the numbers on this one. Maybe it requires a new technology. But, let's assume it's possible. Think about it. Low fixed wait states. 0 cache misses. No complicated cache hardware - this is a big one. Or, that pipeline burst cache hardware instead force feeds the execution pipeline with instructions. Cache misses happen a lot. And, surely it's going to be faster than external memory. Is it expensive, money-wise? Is that why we don't see it being attempted Maybe you could spread the memory across multiple cores.

Some algorithms simply cannot avoid reading memory "vertically", such as 90-degree block rotation algortihms. These functions can suffer a cache invalidation on each read, unless proper cache hints are provided...yielding a needlessly slow read/write.

To me, cache misses and branch mis-prediction are the 2 things that prevent us from being able to optimize our code accurately, cause we cannot determine how much time each instruction will take, therefore we cannot pair instructions perfectly. Knowing exact memory timing allows the pipeline to be finely-tuned, I'd imagine. Removing the cache hardware provides fast chip real estate to put RAM, and simplifies the design enough to justify research into making it possible.

And, your memory map could sit in this same main memory too.

Author: Hubert Lamontagne Date: 2016-09-13 12:12
Kurt Baumgardner wrote:

Back to the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

http://www.zdnet.com/article/why-doesnt ... n-the-cpu/
http://electronics.stackexchange.com/qu ... e-cpu-chip

2. If not #1, then some really good cache hint/directive instructions.

Generally, this is done with instructions like PREFETCH on x86.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

A lot of branch instructions that your CPU pipeline will see are in loops and generally branch the same way a lot of times in a row, and would actually get slower if you ran both sides, because rarely executed paths would be competing for resources. Here's a random C++ example:

Code: Select all

for(int i=0; i<200; i++) {
float sample = buffer[i];
mFilterMem += (sample - mFilterMem) * mFilterParam;
sample = mFilterMem;
if(mVolumeFade) {
sample *= mVolume;
mVolumeFade += mFadeRate;
}
buffer[i] = sample;
}

The branch at the end of the loop (i<200) will be taken 199 times in a row, and only skipped once. Speculatively executing the code after the loop makes no sense in this case. In addition to this, the branch in the middle of the loop (if(mVolumeFade)) will either always be taken or never be taken in the 200 iterations, so in that case speculatively executing both sides of the loop doesn't make sense either, and it's better to trust your branch predictor.

Speculatively executing both sides is generally something you'll only see in very large and complex out-of-order CPU cores, and I think it probably involves something like a 3 or 4 way branch predictor that predicts NO-JUMP/MAYBE-JUMP/YES-JUMP or NO-JUMP/MAYBE-NO-JUMP/MAYBE-YES-JUMP/YES-JUMP. I don't think it handles multi-way branch prediction too well either (ie jump to function pointer... some modern cores can handle multiple jump targets like this).

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

Sounds a lot like a DMA controller - and if I'm not mistaken, the x86 DMA controller can be used for that, but I don't think it runs fast enough to get a speed gain. And it kinda competes with the CPU for memory access cycles, which isn't too good, although I guess you could make it operate in the unused cycles.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design. But, maybe one has merit, so I present them for thought. I find the whole project fascinating, and I hope one day that you can get chip to the fabrication stage! Best of luck!

One that's worth it and that is used on ARM is to have 2 stack pointer registers, one for USER mode and one for OS mode, and switch when interrupts happen, because this makes interrupt handlers faster. Fast interrupt handlers are generally nice to have because then it makes stuff like software TLBs realistic, and it speeds up OS calls. Fast task switch between user programs is a bit less important because you're going to have tons of cache misses anyways, so not having to save/restore registers isn't as important (and a lot of programs can run at the same time, which makes it unlikely that you can hold all register sets at the same time).

Author: Kurt Baumgardner Date: 2016-09-14 16:56
Thanks, guys - great, informative, interesting answers. I wish I knew more about the architecture. I would like to come up with a way to mitigate the costs of a pipeline flush. It seems like such a shame of a waste when it happens. As good as modern branch prediction is, a lot of that benefit of smart prediction is thrown away. Maybe, while taking both branches, another predictor would realize that this wrong branch was being repeatedly taken, and would instead store an image of the pipeline that could be switched to when the misbranch was actually taken, and the other context could be flushed in the background.

I think I would pull my hair out trying to design such a thing. @Agner - I think you're on the right track avoiding microcode, keeping the design simple, and seeing just how well it can be made to perform. Best of luck.

Author: Hubert Lamontagne Date: 2016-09-11 17:54
Kurt Baumgardner wrote:

By the way, I *really* like the idea of programmer-built extensions to the instruction set. I would love to have the following capabilities:

1. Software gate connection (direct gate array access).

That sounds a lot like a CPU+FPGA combo... That would be a nice thing to have, I'd say the challenge would be that it's hard to share an FPGA between multiple programs. The usual strategy is to save/restore the state on task switch, but that's hard with a FPGA, since there's a lot of setup data, and it's hard to save/restore all the registers and block RAMs. You could try to share parts of the FPGA but that sounds very hard to do... though I guess you could split it in N units with a bunch of gates each, and each application would take up some number of units. Another model would be how video cards are shared.

2a. Microcode programming (if so equipped), or

That's a bit hard if most of the operations on your CPU are already single-microcode, like on a RISC. For a machine with combined memory+arithmetic instructions like x86 or ForwardCom, I guess you could make the combinations of memory-op + arithmetic-op user-configurable, or offer configurable multi-step ops like multiply-accumulate. You could also separate the address calculation and the load/store operation, as long as you don't add new register file writes/reads on the intermediary steps - you could have hidden extra registers that are only visible by sub-operations of your instruction, which wouldn't have to be renamed/loaded/stored to the main register file.

2b. the ability to control load/store, ALU, bit shift, cache control, vectoring, etc hardware.

VLIW cpus do something a lot like that. You basically specify which operation each part of the CPU is doing. They haven't really caught on yet because they have trouble dealing with things like memory store instructions needing to have the target address known early (so that program order in loads/stores can be preserved without requiring the compiler to do ultra-aggressive alias analysis) but the data to be written has to be known late (so that the CPU has time to calculate it).

3. An on-chip cache dedicated to user instructions. Could be used for extremely fast lookups/translations, pre-calculated math, etc.

(one can dream

Kurt

Hmmm... that might not be a bad idea, though it would make the issue hardware of the CPU appreciably more complex (so it would make branch mispredictions more costly). It would be useful to implement CISC instructions on a RISC base unit for instance... basically it would make a good cache for implementing the 2a proposition.

Author: Kurt Baumgardner Date: 2016-09-13 15:51
Hubert Lamontagne wrote:

Kurt Baumgardner wrote:

By the way, I *really* like the idea of programmer-built extensions to the instruction set. I would love to have the following capabilities:

1. Software gate connection (direct gate array access).
That sounds a lot like a CPU+FPGA combo... That would be a nice thing to have, I'd say the challenge would be that it's hard to share an FPGA between multiple programs. The usual strategy is to save/restore the state on task switch, but that's hard with a FPGA, since there's a lot of setup data, and it's hard to save/restore all the registers and block RAMs. You could try to share parts of the FPGA but that sounds very hard to do... though I guess you could split it in N units with a bunch of gates each, and each application would take up some number of units. Another model would be how video cards are shared.

I did not consider multiple programs wanting to use the FPGA - that *is* an ugly issue. Maybe it could be treated like a device, and programs would have to make a reservation, and obtain an exclusive lock for, at least, a portion of the array. I'm not sure there's a good answer there. But, you're onto something: It could be similar to the extra processing offered by GPUs. The reason it works as well as it does for GPUs is that, typically, only one program is using the GPU at any given time (when rendering games, which is, of course, the most visible use).

Author: Agner Date: 2016-09-14 00:45
Kurt Baumgardner wrote:

Believe me, I like the idea of the super-fast memory map, and the dual system idea is interesting.[...] What programmer would choose the slow model?

The OS will choose for you. If your application is well behaved and avoids memory fragmentation then you will use the on-chip memory map. If you allocate and deallocate huge amounts of memory randomly so that the memory becomes fragmented then the software based memory paging will be activated when the on-chip memory map is filled up. If the user opens and closes a lot of memory hungry applications so that there is no contiguous space to open the next program in then the OS may issue a warning and give the user the option to close some of the applications or run a garbage collector that moves some memory blocks around. The programmers will learn that if they want top performance they should exercise some discipline in memory allocation. I believe that most applications will be able to run without memory fragmentation.

Hubert Lamontagne wrote:

A lot of branch instructions that your CPU pipeline will see are in loops and generally branch the same way a lot of times in a row, and would actually get slower if you ran both sides, because rarely executed paths would be competing for resources. Here's a random C++ example

The branch predictor will know if a branch is easily predictable. It could select the branches with the worst prediction rate for executing both sides speculatively. In your example, an optimizing compiler may recognize that the inner branch is loop invariant and move it outside the loop, giving you one loop in each branch. The branch predictor should be able to get perfect prediction for a loop with constant count.

Kurt Baumgardner wrote:

16Gb memory on the same chip as the CPU

Hubert Lamontagne wrote:

http://www.zdnet.com/article/why-doesnt ... n-the-cpu/
http://electronics.stackexchange.com/qu ... e-cpu-chip

This is an obvious thing to do. I believe that we will see a lot of on-chip RAM some day when technology allows it. It may be SRAM. Small system-on-a-chip microcontrollers already have both SRAM and Flash on the chip. ForwardCom may be implemented in this fashion for some dedicated applications.

Kurt Baumgardner wrote:

I *really* like the idea of programmer-built extensions to the instruction set

You may have an FPGA on the chip. An application should be able to monopolize this FPGA, or at least a part of it. I am trying to avoid microcode completely in ForwardCom.

Author: Jorcy Neto Date: 2016-09-18 13:55
Agner wrote:

Kurt Baumgardner wrote:

16Gb memory on the same chip as the CPU
Hubert Lamontagne wrote:

http://www.zdnet.com/article/why-doesnt ... n-the-cpu/
http://electronics.stackexchange.com/qu ... e-cpu-chip
This is an obvious thing to do. I believe that we will see a lot of on-chip RAM some day when technology allows it. It may be SRAM. Small system-on-a-chip microcontrollers already have both SRAM and Flash on the chip. ForwardCom may be implemented in this fashion for some dedicated applications.

Actually, they've (Intel) already been studying the performance gains and power issues regarding logic+DRAM stacking quite since mid-2000's (see: "Die Stacking (3D) Microarchitecture" here http://www.cs.cmu.edu/afs/cs/Web/People ... lack06.pdf )

Post by **agner** » 2017-11-02, 13:29:05

Author: csdt Date: 2016-09-23 12:55
Very interesting project.

What if there were a special named register (eg: v0) with a special meaning:
- v0 always contains zeroed bits
- using v0 as input or output doesn't create any dependency.
We could have the same for r0.

This would be used to quickly get a zero value without creating any false dependency. I really don't like the x86 xor hack.
It could be used with any instruction, and can provide an easy way to compute some stuff in one single instruction without having a new instruction for it:
- negation: sub.f v1, v0, v1
- bitwise not: and_not v1, v0, v1
- getting -1 (int): and_not v1, v0, v0
- getting -0.0: set_bit v1, v0, $31
- compare to zero: compare v1, v1, v0, $whatever_the_comparison_is

It may also be used to discard a result:
- prefetching: move v0, [ r1, length=16]

The main drawback I can see is a register name wasted. It would also make the register renaming a bit more complicated (I think), but not so much afaiu.

I would be pleased to know what do you think of this.

Author: Agner Date: 2016-09-24 00:25
Some instruction sets have a null register. But, as you say, it is a register name wasted.

ForwardCom uses the principle of a null register in three cases:

Specifying r0 or v0 as a mask means no mask and unconditional execution.
Specifying the stack pointer (r31) as an array index means no index.
Specifying the stack pointer (r31) for vector length means a scalar.

There is no need for a register with the fixed value zero because a constant can be embedded in the instruction, and this constant can be zero or any other value.

Tiny instructions (half size) can have a 4-bit signed constant. Single size instructions (32 bits) can have a constant of 8 or 16 bits. Larger instructions can have constants of 32 bits (support for 64 bits constants is optional). This makes it possible to set a register to zero or any other simple constant without increasing the size of the instruction. There are various methods for compressing larger constants, and it also works for floating point values.

Author: Hubert Lamontagne Date: 2016-09-24 09:39
The nice thing about a null register is that it lets you remove some opcodes. MIPS does this:

- There is no actual MOV instruction. It's actually replaced by "OR $rd, $rx, $rx", and "ADD $rd, $zero, imm", and "LUI $rd, $zero, imm".
- This lets you do absolute addressing: "LW $rd, offset($zero)".
- NEG is of course "SUB $rd, $zero, $rx".
- NOT is "NOR $rd, $zero, $rx".
- You can store 0 to memory directly: "SW $zero, offset($ra)".
- Comparison instructions like BEQ, BEQL, BNE, BNEL don't have an immediate field (since it's used by the branch offset) but can still compare to 0, which is a very common value to compare with.
- As previously mentioned, values can be discarded by writing them to $zero (I'm not sure what's the point on MIPS though since it has an actual PREF instruction).

Generally, whether or not this is a gain tends to depend on register file size:
- SuperH and x86 have 8 registers so the 8th register is way more important than a hardwired zero.
- ARM has 16 registers and has opted not to have a zero register either (though it does have the PC mapped to a register for some reason).
- MIPS and clones such as RISCV, ALPHA, ARM64, PA-RISC have 32 registers and C++ compiler register allocators rarely fill all 32, so the very slightly reduced number of opcodes is a small but useful gain. Note that ARM64 uses the same register name for the stack pointer and zero register (which one is used depends on the opcode). One of the downsides of a zero register is that it does make software emulation more complex (since you have to check that operation destinations aren't the zero register). I also wonder if this has made out-of-order register renamers more complex or not. And considering that ForwardCom has a ton of opcodes already, I don't think removing a few ones with a zero register has that much impact anyways.

Author: csdt Date: 2016-09-26 03:43
Thank you for your hindsight.

Is it possible to use any vector instruction with an immediate scalar broadcasted?
In my mind, that was a use case for the null register.

Discarding the result is not so interesting with ForwardCom as there is no instruction with 2 outputs.
Otherwise, It could have been interesting for the division (discards the remaining).

Author: Agner Date: 2016-09-27 01:21
csdt wrote:

Is it possible to use any vector instruction with an immediate scalar broadcasted?

Yes. The immediate value can have different sizes, integer or float or a small signed integer converted to float.

Post by **agner** » 2017-11-02, 13:32:41

Author: Kurt Baumgardner Date: 2016-09-26 22:00
Ok, here's an idea I haven't heard of: Set up some registers: ProgramID, ThreadID, and ContextID. These are controlled by the CPU and set up by the OS when spawnning a program or thread. The app programmer only has access to ContextID. The bits of each of these combine to address a memory buffer for the CPU registers. A thread or program task switch simply indexes a different set of registers - no need to preserve registers between task switches. You could even include some instructions for register exchange (EXX) that simply change ContextID. Maybe this mechanism could also supply register renaming functionality...?

The actual size of the registers is implementation specific, it just depends on how the address bits are devied up. This supports the variable vector registers. Of course this would have to be a very fast memory. The idea came from the Page 0 stuff you could do on the good ol' 6502, where the first page of memory could be addressed with 1 byte.

If the actual registers were separate, and the register page could be copied from/to the real registers, you could use ContextID to fill all registers at once,, for very fast function initialization. The compiler could place instructions at program start that would fill up register contexts for critical functions, and call up that register set when entering each function. There's a lot of possibilities. It all depends on the ability to read from/write to this register page very quickly, and the ability to put enough of this memory on the chip. The amount of memory would limit the number of concurrent programs, threads and contexts, as well as the number and size of registers. Maybe you'd have to consider a program and a thread the same thing, which could reduce the size of this memory somewhat.

If done right, I think this could really speed up task switches and function initialization and save stack space. Another idea could be a partial register exchange. ProgramID and ThreadID (or just ThreadID) point you to the proper register bank, but a special command could cause, say R0 thru R15 to always read from ContextID 0, but R16 thru R31 would read from the Context set by this special bank switch command. For a 2-bit ContextID, that provides you another 64 registers to use, 1 bank of 16 at a time. So R0 thru R15 stay the same, but R16 thru R31 are pulled from 1 of 4 banks. Or you do the full context switch, for 32 out 128.

I'm not sure how this would affect other processes, but on the surface, it sounds powerful.

Please note that I am describing 2 opposing possible setups:
Setup #1: The registers live in this memory block, and are one and the same.
Setup #2: The registers are read from this memory block with a GetContext command, and written to the memory block with a PutContext command.

I see benefit in both setups. Unfortunately, I think either setup might have performance problems, unless some very specialized burst read/write hardware is built. But it could be a joy to program. Imagine placing all your immediates and constants into a register context once at the start of your program. Then, when you call your function, a single command sets up all your constants, your loop counts, and your buffer pointers, without using the stack, or even setting registers. Compilers would have to be a bit different, but we're trying for a better design, right?
Could be interesting. You might even be able to justify reducing your number of registers, and depend on bank swapping, which would reduce your instruction set considerably.

Maybe a routine could "push it's context" to a function to achieve by-reference registers. Or copy it's context into another for by-value, without using the stack, or explicitly setting registers.

Is it a crazy idea? It is complex, but could be very powerful. I wonder if it is practical.

Ok, here's my last idea: If the above cannot be done, how about an instruction that pushes registers onto the stack, based on a mask. The push-by-mask opcode is followed by an immediate 32-bit value, with 1 bit dedicated to each register. Registers are always pushed in the same order. To push r0 thru r7, you'd use an immediate value of 011111111b. Or, to push r2, you'd use 0100b. I guess this would be slow, but it would be compact, and useful. Pop would, of course reverse, so the same mask could be used.

Author: Agner Date: 2016-09-27 01:46
My proposal has some features for speeding up context switches, but not quite as radical as you are proposing.

The memory map is so small that it can be contained on the chip. There are separate memory maps for system code. In benign cases where a process uses only a small part of the total memory map area on the chip, you may swap memory maps when switching processes.

All code is addressed relative to the instruction pointer and all read/write memory is addressed relative to the data pointer and the stack pointer. This means that everything is addressed relative to just three registers which can quickly be changed. There are separate registers that are only accessible to system code. These can be useful in a device driver. The device driver needs access to the registers of the application program because these registers (actually, only half of them) can be used for parameter transfer to the device driver. The extra registers can be used by the device driver for temporary storage of the application's registers.

I interpret your Setup # 1 as something like register banks that can be swapped. This may be added as an optional feature on ForwardCom, but I don't know if the costs in terms of central silicon space can justify it.

Your setup # 2 is reading and writing all registers to some memory area or cache. I have avoided an instruction for "read all registers" or "write all registers" because I want to avoid complex micro-coded instructions and because these instructions are very slow on x86 (I don't know about other architectures). Maybe a special on-chip cache for the sole purpose of saving registers on context switch would be useful.

Author: Kurt Baumgardner Date: 2016-09-27 21:49
Makes sense. Yeah, the only way setup #2 would be worth it is if the registers could all be stored in a single cycle, like a latch, which would require an unconventional memory with a very wide data bus. Probably not practical. My goal has been to present new ideas, and let you smarter guys consider the possibilities and practicality. If this ability could be built, it would radically change the way functions are built, and executed, and it has support for your variable vector widths, which made it seem like a nice fit. I just imagined the possibility of having nearly zero cost functions, with all constants and immediate values already set and ready to go, without any stack usage, or register setup. It would be something like inlining your whole program. Oh well

Your current setup seems like a nice compromise.

Author: Agner Date: 2016-09-28 00:53
Kurt Baumgardner wrote:

I just imagined the possibility of having nearly zero cost functions, with all constants and immediate values already set and ready to go, without any stack usage, or register setup.

If you talk about function calls, and not context switches, then I can almost fulfill your wish. ForwardCom supports a register allocation mechanism where the compiler can see which registers are modified by e.g. a library function. It may use the same registers for temporary variables that do not have to be saved across the function call, and use different registers for variables that need to be saved across the function call. With 32 registers of each type, we will have no register spilling in the critical inner nesting levels. If the program has deeply nested function calls then the critical innermost loops are likely to be in the deeper functions while the outermost nesting levels are unlikely to matter in terms of performance. Return addresses can have their own on-chip stack separate from the data stack so that you have near zero cost function calls.

Author: Hubert Lamontagne Date: 2016-09-28 02:34
* For putting all registers in a cache-like memory block:
This would require a very large register file. For instance, the integer register file on some recent Intel cores has 168 registers. For ForwardCom's 32 registers, you can fit 5 whole contexts in there. But you have to reserve some registers for renamed register values that haven't been retired yet for out-of-order execution to operate on, so you need more than that. For running a modern OS, you'd probably need dozens and dozens of contexts, which would require something as large as a data cache. Ex: 8 bytes per register * 32 registers * 64 contexts = 16kb - that's the size of the whole L1 data cache on some modern CPUs!

More reasonable versions of this already exist though - I think x86 already does this for HyperThreading, with 2 different contexts at the same time. Another case where it's really useful to have just 2 bankswitched sets of registers is context switching between the user program and the OS - for handling page fault exceptions and memory allocations and software TLBs and software emulated instructions and so forth. Often, this involves only bankswitching only a few registers, such as user SP versus system SP (some ARM systems).

* For saving/restoring registers in a cache-like memory block with separate instructions:
It's going to be hard to save/restore all the active registers of a context at the same time because your physical registers will be assigned to an unpredictable subset of registers. Which means that you'd need to have 32 register read/write ports to read/write all your registers the same time. Afaik, register files with that many ports are large and power hungry. And presumably you'd need a special multiplexing mechanism to share those ports with the read/write accesses of normal operation.

You also still has the problem that your cache-like memory block has a finite size, which means that sooner or later the OS will run out and will start to read/write parts of it to main ram anyways - which is what this mechanism was trying to prevent in first place!

In theory you could map this special memory to be backed by RAM which solves size limits, but then all operations that read or write to your special memory become potential RAM operations and have to be issued potentially in-order with other memory operations and compete for data cache access ports, and can potentially trigger interrupts which means that either the CPU has to run those operations in order (with stalls when waiting), or all those operations have to run speculatively and be undoable (which means that the previous state has to be preserved until your operation can be confirmed as having run for real). Even without RAM mapping, you'd probably still need a write buffer.

* For having load multiple/store multiple instructions
Instructions that push/pull a whole bunch of registers on stack at the same time do exist. ARM has one. 68000 has one. It's nice for improving code density. The problem is that it's still an instruction that generates essentially up to 32 memory writes/reads at the same time (plus a stack pointer update). Data caches can't really handle that - they can handle at best 2 reads + 1 write per cycle. The register file can't handle that either. I'm not sure it can run faster than just a simple normal series of read/write operations, plus it would probably have to be implemented as microcode.

Author: Kurt Baumgardner Date: 2016-10-03 21:34
Thank you for the detailed answers Agner and Hubert.

Hubert Lamontagne wrote:

* For putting all registers in a cache-like memory block:
This would require a very large register file....

* For saving/restoring registers in a cache-like memory block with separate instructions:
...(more issues)...

Well, that's a shame. I've been looking at some 80x86 disassembly of a C .exe recently, that contained a lot of function calls. Most of them do a bunch of PUSHes, read some stack vars, followed by the function's actual purpose, and end with the POPs, and finally the return. And, of course that's in the majority of function calls: this constant pattern of register save, fill, use, reload, followed by return (another stack pop). Seems like there should be a better way. Maybe instead of a set for every program/thread, you could have just enough space for a few levels deep of registers. This would correspond with the "call stack". Since most calls come with PUSHes, and most returns are proceeded by POPs, it seems like a prime candidate for optimization, for both speed and code size.

Maybe that's still too complex or slow for hardware, though.

Agner wrote:

If you talk about function calls, and not context switches, then I can almost fulfill your wish. ForwardCom supports a register allocation mechanism where the compiler can see which registers are modified by e.g. a library function. It may use the same registers for temporary variables that do not have to be saved across the function call, and use different registers for variables that need to be saved across the function call. With 32 registers of each type, we will have no register spilling in the critical inner nesting levels. If the program has deeply nested function calls then the critical innermost loops are likely to be in the deeper functions while the outermost nesting levels are unlikely to matter in terms of performance. Return addresses can have their own on-chip stack separate from the data stack so that you have near zero cost function calls.

Yes, I can see how this *could* reduce most of that PUSH/POP stuff in function calls. But, can the compiler be smart enough to follow the code flow through those nested calls? (I apologize in advance if I've missed something about ForwardCom that simplifies this task). Most compilers I've experienced simply preserve a function's used registers, in the function - in each function, because the compiler does not anticipate the code flow. And, of course, in many cases, it simply cannot: CALL r15, or even CALL (r15) for example.

Author: Agner Date: 2016-10-03 23:54
Kurt Baumgardner wrote:

Yes, I can see how this *could* reduce most of that PUSH/POP stuff in function calls. But, can the compiler be smart enough to follow the code flow through those nested calls? (I apologize in advance if I've missed something about ForwardCom that simplifies this task). Most compilers I've experienced simply preserve a function's used registers, in the function - in each function, because the compiler does not anticipate the code flow. And, of course, in many cases, it simply cannot: CALL r15, or even CALL (r15) for example.

The compiler has to obey the function calling convention if the function is publicly visible. The calling convention for 32-bit x86 generates a lot of push and pop. The most efficient calling convention is in 64-bit x86 on Linux. Gnu, Clang and Intel C++ compilers can be quite efficient when the relevant optimization options are on.
See my calling convention manual at www.agner.org/optimize/#manuals

Author: Hubert Lamontagne Date: 2016-10-04 10:57
Kurt Baumgardner wrote:

[...]
Well, that's a shame. I've been looking at some 80x86 disassembly of a C .exe recently, that contained a lot of function calls. Most of them do a bunch of PUSHes, read some stack vars, followed by the function's actual purpose, and end with the POPs, and finally the return. And, of course that's in the majority of function calls: this constant pattern of register save, fill, use, reload, followed by return (another stack pop). Seems like there should be a better way. Maybe instead of a set for every program/thread, you could have just enough space for a few levels deep of registers. This would correspond with the "call stack". Since most calls come with PUSHes, and most returns are proceeded by POPs, it seems like a prime candidate for optimization, for both speed and code size.

Maybe that's still too complex or slow for hardware, though.

Function calls and their associated push/pops are very common, but a large majority of them happen on cold paths:
- Anything that happens on a cold path rarely makes any noticeable speed difference... Even moderately warm paths tend to count a lot less than the hottest paths in overall speed.
- The hottest paths often are loops and tend to either not have any function calls, or have calls that are only occasionally called (error conditions etc), or have calls that can be inlined - which means the compiler can automatically allocate registers and even reorder operations within the parent function, as long as the function is small enough for this to be beneficial.
- Hot loops that do have function calls are often very large and complex loops that are likely to stall for all sorts of other reasons, such as not fitting in the instruction cache, having data cache misses, having pointer-following chains that stall due to data cache latency, having hard-to-predict branches, loading/storing lots of values from objects and so forth, so the function call overhead might not stand out in the wash, or if you're lucky, it could happen while the CPU is waiting after something else (which makes it free).

One option would be register windows like on SPARC, which acts as a hardware assisted stack. Here's an article about that:
http://ieng9.ucsd.edu/~cs30x/sparcstack.html

Another option would be load-dual and store-dual instructions, which let you combine two 64-bit register memory operations into a single 128-bit store (as long as your stack is 128 bit-aligned) and is probably as fast possible for saving/restoring a bunch of registers at the same time (especially if you can dual-issue it to a 2-port data cache).

Post by **agner** » 2017-11-02, 13:46:06

Author: Hubert Lamontagne Date: 2016-10-28 17:38
Bonus question: How well does ForwardCom handle bilinear interpolation?

It's the one algo that I can think of that just doesn't fit well in most CPUs, wastes a bunch of cycles twiddling bits around and generating memory addresses and interpolation factors and applying them across RGBA channels (multiplied by the number of texture layers used), so that GPUs have a big advantage doing this one particular thing. It can also show up in other tasks, such as wavetable synthesis (interpolation on the time axis of the waveform + interpolation between waveforms). On most CPUs like x86 or ARM, you can use vectors to do the interpolation part but you still have to do lots of integer scalar ops to generate the memory addresses. And if the U V coordinates are generated per pixel and can't be linearly interpolated, then having to send the interpolation factors over to the vector unit and scramble them into place can easily wipe out potential gains.

Is there a way to do a particularly fast version of this on ForwardCom?

Author: Agner Date: 2016-10-29 00:48
Hubert Lamontagne wrote:

How well does ForwardCom handle bilinear interpolation?

It depends on how your data are organized into vectors/arrays. It requires a good deal of permutation. ForwardCom has good permutation instructions, but so does other vector instruction sets.

Author: Hubert Lamontagne Date: 2016-10-29 15:31
Agner wrote:

Hubert Lamontagne wrote:

How well does ForwardCom handle bilinear interpolation?
It depends on how your data are organized into vectors/arrays. It requires a good deal of permutation. ForwardCom has good permutation instructions, but so does other vector instruction sets.

Ok, how about the most classic texture organization: single texture, 256x256, 32bpp RGBA (so that's 8 bits per component, can be either RGBA or ABGR order), no swizzle. Can that be bilinear-interpolated efficiently?

Author: Agner Date: 2016-10-30 00:42
I don't think I understand your problem. You have four RGBA points in each their vector register. All of these should be multiplied by a factor and then it should all be added together. It's just multiplication and addition of 8-bit integers. You may zero-extend all to 16 bits to avoid loss of precision; then shift right and compress back to 8 bits.

Author: Hubert Lamontagne Date: 2016-10-30 21:47
Agner wrote:

I don't think I understand your problem. You have four RGBA points in each their vector register. All of these should be multiplied by a factor and then it should all be added together. It's just multiplication and addition of 8-bit integers. You may zero-extend all to 16 bits to avoid loss of precision; then shift right and compress back to 8 bits.

In pseudo-ASM it looks somewhat like this:

Code: Select all

loop:

; generate memory addresses
shr r0, u_coord, 24
shr r1, v_coord, 24
shl r1, 8
add r4, r0, r1
add r2, r0, 1
and r2, 255
add r5, r2, r1
add r3, r1, 256
and r3, 65535
add r6, r0, r3
add r7, r2, r3

; load the 4 texels
mov v0, int32 [texture_adr + r4*4]
mov v1, int32 [texture_adr + r4*5]
mov v2, int32 [texture_adr + r4*6]
mov v3, int32 [texture_adr + r4*7]

; expand components from 8bits to 16bits to avoid wrap-around problems
expand.u8.u16 v0
expand.u8.u16 v1
expand.u8.u16 v2
expand.u8.u16 v3

; generate interpolation factors
shr r8, u_coord, 16
and r8, 255
mov broadcast.u16x4 v4, r8
shr r9, v_coord, 16
and r9, 255
mov broadcast.u16x4 v5, r9

; linear interpolate
sub.s16 v1, v0
sub.s16 v3, v2
mul.s16 v1, v4
mul.s16 v3, v4
shr.s16 v1, 8
shr.s16 v3, 8
add.s16 v1, v0
add.s16 v3, v2
sub.s16 v3, v1
mul.s16 v3, v5
shr.s16 v3, 8
add.s16 v3, v1

; alpha blend
mov v6, u32 [screen_ptr]
expand.u8.u16 v6
mov broadcast.s16x4 v7, v3[3]
sub.s16 v3, v6
mul.s16 v3, v7
shr.s16 v3, 8
add.s16 v3, v6
shrink.u16.u8 v6
mov v6, u32 [screen_ptr]

; increment and loop
add screen_ptr, 4
add u_coord, u_delta
add v_coord, v_delta
cmp r10, screen_ptr, span_end_ptr
jz r10 loop

So the problem I'm seeing is not as much the linear interpolation as having to do ~50 instructions per pixel to do your addressing, interpolation, alpha blending and so forth. I guess you could compute multiple pixels together to soften the blow a bit - which I guess would be good for the interpolation and alpha blending parts, although it doesn't help too much for the addressing part.

Post by **agner** » 2017-11-02, 13:48:27

Author: Agner Date: 2016-12-08 03:44
Version 1.04 update today:

The instruction formats are made more consistent. Template E2 is modified.
The masking principle has been changed. Now there is an option to specify a fallback value to use when a mask bit is zero. The fallback value can be zero, or an input operand, or any other value specified by a register. Register r0 and v0 are allowed as masks to facilitate the use of boolean functions that have return values in r0 or v0.
The compare instruction has additional features with comparison of absolute values, and optional extra boolean operands.
Conditional jumps are modified.
Several other instructions are modified.

Author: Matthias Bentrup Date: 2016-12-12 09:42
As the compare_jump_* instructions cover basically all conditional control-flow cases, I guess that the compare instruction will be used mostly to compute mask registers. So I wonder if it would be more economical to restrict RD and RS to 3 bits and so recover 2x2 bits to store the condition code in the instruction itself.

Reading the condition code from a register may have some novel use cases, but I think that in 99% of all code (especially in compiled code) the condition code is fixed.

Author: Agner Date: 2016-12-12 11:46
Matthias Bentrup wrote:

I wonder if it would be more economical to restrict RD and RS to 3 bits and so recover 2x2 bits to store the condition code in the instruction itself.

You are right that it would be more compact, but it would break the uniform template system. This would make the hardware more complex and it would also generate an extra complication for the compiler. The present specification allows you to make a compare with fixed condition in a single-word instruction if the destination register is the same as one of the source registers.

Author: Matthias Bentrup Date: 2016-12-14 03:44
Agner wrote:

[...] The present specification allows you to make a compare with fixed condition in a single-word instruction if the destination register is the same as one of the source registers.

But unless I am mistaken you need two words if you want to compare a register to an immediate value. There a trade-offs in both directions and I don't think that either way is clearly better than the other one.

Post by **agner** » 2017-11-02, 13:51:06

Author: Joe Duarte Date: 2016-12-14 06:13
Hi Agner. I've been thinking about some interesting research I've read recently, and it raises a couple of questions for me:

1. Does ForwardCom assume synchronous system calls? Could it support a flexible async syscall architecture? I'm not sure where the boundary between an IAS and syscall architecture should lie, but it seems like it could lie wherever you want it to lie – you can choose to fold syscall specifications into an ISA, or to bracket it off as an OS concern.

I ask because this work on async sys calls is very compelling: https://www.usenix.org/legacy/events/os ... Soares.pdf

2. What do you think of the family of horizontal packing operations that the Parabix team pines for? http://parabix.costar.sfu.ca/wiki/ParabixTransform

(See their section headed "Ideal Implementation Using SIMD Packing" on the above-linked page.)

Would it be wise for ForwardCom to offer such instructions? Why or why not?

Finally, it would be nice to have a detailed comparison of ForwardCom and RISC-V, much of it in the form of comparison tables that get into specific features and decisions. I volunteer to help assemble such a comparison document if you're too busy.

Author: Agner Date: 2016-12-15 00:15
Thank you for your interesting ideas.

Joe Duarte wrote:

1. Does ForwardCom assume synchronous system calls? Could it support a flexible async syscall architecture? I'm not sure where the boundary between an IAS and syscall architecture should lie, but it seems like it could lie wherever you want it to lie – you can choose to fold syscall specifications into an ISA, or to bracket it off as an OS concern.

I ask because this work on async sys calls is very compelling: https://www.usenix.org/legacy/events/os ... Soares.pdf

System calls in ForwardCom are synchronous because they change access rights and memory map. The switch time will be low because both memory maps can be contained inside the CPU. The "FlexSC" batch processing of system calls proposed in the article you are linking to can be implemented in ForwardCom in the following way: First, you make one normal system call which will share a memory area between application and system. After that, you can use this shared memory area to communicate between the application thread and the system thread. This requies that the application thread and the system thread are running in each their core and that these two cores have easy access to the same cache. There is still a problem of synchronization between the two threads. This would probably require speculative synchronization in order to be more efficient than standard system calls. The system thread might be idle most of the time, and this is a waste of resources.

What do you think of the family of horizontal packing operations that the Parabix team pines for? parabix.costar.sfu.ca/wiki/ParabixTransform

(See their section headed "Ideal Implementation Using SIMD Packing" on the above-linked page.)

Would it be wise for ForwardCom to offer such instructions?

That sounds like a good idea. We may implement an instruction that shuffles the bits in each 64-bit block (b0,b1,b2,...,b63) of a vector register into the order (b0,b8,b16,...,b56,b1,b9,b17,...,b57,b2,b10,b18,...,b58,...,...,b7,b15,b23,...,b63). This can be followed by a permute of the bytes into the eight streams. Do you have any good use cases for the parallel bit streams?

Finally, it would be nice to have a detailed comparison of ForwardCom and RISC-V, much of it in the form of comparison tables that get into specific features and decisions. I volunteer to help assemble such a comparison document if you're too busy.

Good idea. We could make a comparison overview and put in on www.forwardcom.info or github.com/forwardcom. It may also include comparisons with x86 and ARM, but it would be too much to include all common instruction sets. I can make a draft and let you comment on it and make additions.

Post by **agner** » 2017-11-02, 13:53:59

Author: Agner Date: 2016-12-17 03:39
Joe Duarte wrote:

it would be nice to have a detailed comparison of ForwardCom and RISC-V, much of it in the form of comparison tables that get into specific features and decisions. I volunteer to help assemble such a comparison document if you're too busy.

Now I have made a comparison of instruction sets at http://www.forwardcom.info/comparison.html

Please correct any errors and important omissions.

Author: Joe Duarte Date: 2016-12-28 21:01
Agner wrote:

Joe Duarte wrote:

it would be nice to have a detailed comparison of ForwardCom and RISC-V, much of it in the form of comparison tables that get into specific features and decisions. I volunteer to help assemble such a comparison document if you're too busy.
Now I have made a comparison of instruction sets at www.forwardcom.info/comparison.html

Please correct any errors and important omissions.

Looks good. Notes:

1. RISC V should be RISC-V.

2. Why 32 registers? I'm distrustful of register counts like 32, 16, etc. – the Suspiciously Convenient Numerals of Computing. I understand why register width and data types need to be 32 or 64 bits, but we shouldn't expect the optimal number of registers to be 32 any more than 31, 29, or 43. The optimal number should be empirically determined.

3. What about virtualization support? Does ForwardCom need it? I'm thinking of stuff like the vmx instructions. And chipset level stuff like VT-d and SR-IOV, which may or may not be out-of-scope for ForwardCom. The word "virtualization" does not appear in the ForwardCom manual, though perhaps ForwardCom's architecture obviates the need for some of the explicit provisions for virtualization in x86.

Author: Agner Date: 2016-12-29 00:56
Joe Duarte wrote:

Why 32 registers? I'm distrustful of register counts like 32, 16, etc. – the Suspiciously Convenient Numerals of Computing. I understand why register width and data types need to be 32 or 64 bits, but we shouldn't expect the optimal number of registers to be 32 any more than 31, 29, or 43. The optimal number should be empirically determined.

The number of registers is a power of 2 because n bits in the instruction code gives 2n possible combinations. If we had 43 registers then we would need 6 bits in each register field and we would waste 21 unused combinations. We don't want unused combinations because code density is important. Some combinations could have special meanings at the cost of a more complicated hardware. ForwardCom has a few special combinations. 31 means the stack pointer in some contexts, and "no register" in other contexts. 29 means data section pointer and 30 means instruction pointer in some addressing modes.

It is good to have many registers because ForwardCom has a mechanism for register allocation across modules so that register spilling can be minimized. This is the reason why we have 32 registers. 64 registers would require three more bits (one for each of the three register fields in the template) so that the instruction wouldn't fit into a 32-bit code word.

What about virtualization support? Does ForwardCom need it?

Good point. I have thought about support for multiple instruction sets. See chapter 10 in the forthcoming version 1.05. A microprocessor with support for multiple instruction sets would ease the transition to a new instruction set. Virtualization would be useful here to produce one virtual machine for each instruction set, at the cost of a little extra hardware.

Author: Hubert Lamontagne Date: 2016-12-30 19:51
Joe Duarte wrote:

2. Why 32 registers? I'm distrustful of register counts like 32, 16, etc. – the Suspiciously Convenient Numerals of Computing. I understand why register width and data types need to be 32 or 64 bits, but we shouldn't expect the optimal number of registers to be 32 any more than 31, 29, or 43. The optimal number should be empirically determined.

Register IDs have to be decoded fast... it's one of the things that determines the number of stages you need between the instruction cache stage and the register rename stage (fewer stages = less cycles penalty on branch misprediction). The fastest way to encode register IDs in instructions is simply to have N bits that are always at the same position in the instruction, which is how you end up with 8, 16 or 32 etc logical registers.

The number of physical registers after rename can vary a lot more though - for instance, Sandy Bridge has 160 integer and 144 floating point physical registers after renaming, and this is a carefully optimized number so that the register file size has the right balance of propagation delay through the chip vs instruction reordering capacity.

Author: Hubert Lamontagne Date: 2017-01-05 16:47
RISC-V is starting to reach silicon. Performance is looking pretty good, for a microcontroller. (comparable to the ARM in a Teensy)

Post by **agner** » 2017-11-02, 13:54:51

Author: Agner Date: 2017-01-22 04:39
Changes in version 1.05:

Systematic description of all instructions.
Added chapter: Support for multiple instruction sets.
Added chapter: Software optimization guidelines.
Bit manipulation instructions improved.
Shift instructions can multiply float by power of 2.
Integer division with different rounding modes.

Author: Jiří Moravec Date: 2017-01-23 11:58
Hi. In table 3.16 on page 24 is probably error. In D template is only 3bit length field OP1, so OP can't be 0-15, but just only 0-7 in "1.5 D".
Same problem is in table 3.17. Unconditional jump/call has OPJ=0-7 and 8-15 in 3bit field. ???

And another question: How do you discriminate between jump instructions in format 1.5C and 1.5D format?
IL and Mode fields are same. And if whole 3bit OP1 field in 1.5D format was used, there are no other bits for different encoding 1.5C!

Author: Agner Date: 2017-01-24 00:36
Jiří Moravec wrote:

Unconditional jump/call has OPJ=0-7 and 8-15 in 3bit field. ???

The lower 3 bits of the 6-bit OP1 field are occupied by part of the 24-bit address in the 1.5 D format. The 24-bit jump has part of the address in the lower 3 bits of OP1 and 000 in the upper 3 bits. The 6-bit OP1 field will then be 0-7. Likewise, the 24-bit call has part of the address in the lower 3 bits of OP1 and 001 in the upper 3 bits. The 6-bit OP1 field will then be 8-15.

Other instructions cannot have format 1.5 if OP1 is less than 16. The instructions that are missing for this reason are subtract constant and jump conditionally. These instructions can be replaced by the corresponding add constant and jump conditionally by using the negative constant. So the rule is that format 1.5 D is used when OP1 < 16 and format 1.5 C is used otherwise.

I organized the puzzle in this way in order to squeeze jumps and calls with 24-bit relative address into the template system. The 24-bit offset is scaled by 4 so that we can cover an address range of +/- 32 Megabytes which is sufficient for most most programs. Unconditional jumps and calls are so frequent that it is nice to have them in single-size instructions.

Post by **agner** » 2017-11-02, 13:56:12

Author: Jonathan Brandmeyer Date: 2017-01-22 23:22
There are a few features of ARM Cortex-M (the microcontroller profile) that I miss when writing system-level programs in other architectures, including Cortex-A/R:

- Lack of a control register space. In Cortex-M, the processor's control registers are all placed in a per-core region of memory at a reserved physical address. Even the memory protection unit's control registers are saved up there. You don't have to muck around with coprocessor control instructions to manage the caches, memory protection, etc: Its all as simple as addressing into the members of a struct based at some fixed address. I believe, but cannot prove, that this makes visualization easier, since the next level down in the supervisor hierarchy can just map the mocked virtual peripheral to the reserved hardware address.

- Automatic stacking of registers. A major design change in Cortex-M versus ARM7** was that instead of having banked registers for handler mode (a system mode equivalent), the core automatically saved the call-clobbered registers to the stack of the thread mode (user mode equiv) stack pointer. It also saves the link register to a value which points into the system reserved memory space that has the meaning of "return from interrupt". Obvious advantage #1: interrupt handlers can be written in plain ordinary C. Slightly less obvious advantage #2 is that if the processor proceeds to execute another interrupt immediately after the first (say, because the first triggered an OS scheduling event, for example), the user mode stack pointer doesn't need to be saved. Similarly, when performing a context switch between different user-mode threads, only the OS scheduling interrupt must save the rest of the user-mode's thread to the their stack frame. Obvious disadvantage is that the ABI is partially baked into the hardware with respect to the call-clobbered registers and stack pointer.

On a SkyLake-class OoO machine, I suspect that you can execute loads to call-clobbered registers in parallel with the writes of those registers, too, as the ISR goes through the dispatch process.

- Interrupt vectoring. Each peripheral interrupt source gets its own handler instruction assigned to it. This makes interrupt dispatch to the particular driver especially easy. I think this can be taken one step further, where the handler can get an argument or few passed to it. This makes the signature of the interrupt handler something similar to that of a closure in c: "void specific_isr(void *os_provided_data, int peripheral_provided_data)" So long as the function address + arguments are less than the size of cache line, I think it shouldn't cost too much even on a big OoO machine. The interrupt controller can be 'far' away from the core it is dispatching to, passing a message the size of a cache line or less to trigger the interrupt. Now we are also baking in a few argument-passing registers, too.

My professional experience is mostly on microcontroller-class hardware, and having interrupt dispatch as fast as a function call is rather pampering. It seems from the outside that the generally poor performance of interrupt handling on larger processors has lead to a series of design attempts to avoid it, some going as far as polling (!) the device instead of taking an interrupt at the time of demand. My general assumption has been that the cost mostly comes from blowing the dcache and icache for even trivial interrupt responses just due to the branchy/pointer-chasey dispatch logic itself. Just having the peripheral interrupt controller perform the vectoring for the operating system would drastically reduce the data-cache pressure incurred by forcing the ISR to perform all of the pointer-chasing dispatch logic on its own, if that underlying assumption is true.

** ARM7 aka ARM7TDMI is actually ARMv4, while Cortex-M is ARMv7. Except for Cortex-M2x which is ARMv8. Sigh.

Author: Agner Date: 2017-01-23 00:05
Jonathan Brandmeyer wrote:

There are a few features of ARM Cortex-M (the microcontroller profile) that I miss when writing system-level programs in other architectures

Thanks for the input. The system instructions of ForwardCom are not developed yet. The idea at this point is that there are two extra sets of scratch registers, one for device drivers and one for system core. A device driver can simply save the user mode registers to its scratch registers rather than to the stack in order to avoid polluting the data cache. I would prefer to save the registers one by one, as needed, to avoid complex micro-coded instructions.

Data stack and call stack are preferably separate. The call stack can stay on-chip most of the time. There may be a separate call stack for device drivers and system code. We can keep the system call stack on-chip by banning recursive function calls in device drivers. This would make it possible to make a device driver that doesn't even touch the data cache unless it needs to do a lot of data processing. The device driver has access to a limited block of the application program's data space which it is likely to have good caching. I don't know how an interrupt handler gets its memory. Maybe the application program can call a device driver to enable the interrupt handler and assign some of the application's memory space to it.

I have not decided how interrupts are dispatched, but we may have the interrupt vector table stored on-chip.

Author: Jonathan Brandmeyer Date: 2017-01-25 07:50
Agner wrote:

The idea at this point is that there are two extra sets of scratch registers, one for device drivers and one for system core.

One of the gotchas from ARM banked registers is that the FIQ register state was retained across ISR return and reentry. In effect, the core has additional named architectural registers that are almost never actually accessible at any given time, yet still consume space in the physical register file. If the IRQ and System scratch registers are instead defined to be zeroed on IRQ or syscall entry, then the underlying physical registers can be freed by the rename engine on exception return, and reallocated automatically on first use after exception entry.

Author: Agner Date: 2017-01-25 13:46
Jonathan Brandmeyer wrote:

One of the gotchas from ARM banked registers is that the FIQ register state was retained across ISR return and reentry.

We probably have to clear the registers for security reasons.

Post by **agner** » 2017-11-02, 14:03:42

Author: csdt Date: 2017-01-27 10:21
Branch predictors is currently a hard part of a CPU.
I propose to help it a bit in the same way we can help the cache: tell it in advance where we want to jump.

I think this can be done by a couple of instructions:
- chose the destination from 2 immediates based on a condition
- unconditionally set the destination of a jump from a register
- unconditionally jump to the previously set destination

This can help basic control flow if the program is not limited by the instruction bandwidth with the second instruction.
And it can also help indirect calls (virtual call through a vtable for instance).

For safety reason, this "special register" should be saved and restored during context switches.
For security reason, the "special register" should be reset after each jump whatever instruction is used for the jump, and a signal should be send if the third instruction is executed while the "special register" is not in a valid state.

What do you think of this ?

Author: Agner Date: 2017-01-27 11:45
csdt wrote:

Branch predictors is currently a hard part of a CPU.
I propose to help it a bit in the same way we can help the cache: tell it in advance where we want to jump.

The hard part of branch prediction is not to predict the target address, but to predict whether it will jump or not. The target address can be calculated at an early stage in the pipeline for direct conditional and unconditional jumps. Certain instruction formats are reserved for jump instructions so that these instructions can be identified easily. The earlier in the pipeline we can calculate the target address, the less is the penalty for not knowing the address of the next instruction, and the less is the need for large branch target buffers (BTB).

It is much harder to predict whether it will jump or not. A good predictor contains large pattern tables that consume a lot of silicon space.

We can reduce the branch misprediction penalty by having a short pipeline and perhaps by fetching and decoding both targets of a two-way branch.

Multiway branches are even harder to predict, but we cannot do without them.

Function returns are easy to predict. If we have separate data stack and call stack, then the call stack will usually be so small that it can reside on the chip.

Author: csdt Date: 2017-01-30 05:06
The point of this solution is to replace conditional jumps by unconditional jumps to predefined targets.

It is always possible to consider a conditional jump like an unconditional jump where the target is either the one we want, or the instruction just after the jump depending on the condition.

Doing this, we can use unconditional jumps even with branches and loops, with a predefined address (set with a previous instruction).
It is possible to have no-latency branches and loops (with no need to prediction), but requires an extra instruction per jump in the instruction flow.

Author: Agner Date: 2017-01-30 07:14
csdt wrote:

The point of this solution is to replace conditional jumps by unconditional jumps to predefined targets.

It is not very clear what you mean here. It is possible, of course, to put the jump target address in a register and then do an indirect jump to the register value. This doesn't solve the problem in current hardware designs because the value of the target register is not known at the time the next instruction has to be fetched into the pipeline. You would need to store the target address several clock cycles in advance and then have a mechanism for signalling to the instruction fetcher at what point it would have to fetch from the new address. The compiler then has to place the switch-target instruction at the right position in the instruction sequence. The distance from the switch-target instruction to the point of divergence has to fit the length of the pipeline, which depends on the specific hardware implementation. This may be possible, but I am not aware of any attempt to construct such a machinery.

It is always possible to consider a conditional jump like an unconditional jump where the target is either the one we want, or the instruction just after the jump depending on the condition.

Some instruction sets have an instruction for conditionally skipping the next instruction. If the instruction that you skip is not a jump instruction, then this is in effect a conditional execution of the instruction. ForwardCom can execute instructions conditionally by using a predicate or mask. This avoids misprediction, but it doesn't save time because the time it takes to not-execute an instruction is the same as the time it takes to execute it. The reason for this is that the value of the predicate register is not known at the time the instruction is scheduled. It might be possible to construct a scheduler that checks whether the value of the predicate register is known sufficiently early to skip the instruction completely and not schedule it for execution. This would require a quite complicated hardware design, and I don't think it has ever been done.

If you have a system that conditionally skips one instruction, and the instruction you skip is a jump instruction, then you have a situation that is equivalent to a conditional jump. The instruction fetcher will not know in advance which instruction comes next.

This all boils down to the issue of pipelining. Current high-end microprocessors have a pipeline of typically 10-20 stages. This means that it has to know 10-20 clock cycles in advance which instruction to fecth next. If this information is not actually known 10-20 clock cycles in advance then you cannot avoid a waste of time if you make the wrong guess, no matter how you implement the branching. Any mechanism that makes this information known to the instruction fetcher sufficiently early would be quite complicated to implement both in the hardware and in the compiler.

Author: csdt Date: 2017-01-30 09:53
Agner wrote:

csdt wrote:

The point of this solution is to replace conditional jumps by unconditional jumps to predefined targets.
It is not very clear what you mean here. It is possible, of course, to put the jump target address in a register and then do an indirect jump to the register value. This doesn't solve the problem in current hardware designs because the value of the target register is not known at the time the next instruction has to be fetched into the pipeline. You would need to store the target address several clock cycles in advance and then have a mechanism for signalling to the instruction fetcher at what point it would have to fetch from the new address. The compiler then has to place the switch-target instruction at the right position in the instruction sequence. The distance from the switch-target instruction to the point of divergence has to fit the length of the pipeline, which depends on the specific hardware implementation. This may be possible, but I am not aware of any attempt to construct such a machinery.

The idea is not to help the prediction, but to "override" the prediction for a single given jump instruction. So there is no failed prediction in the same way as you cannot "mispredict" an unconditional jump to a fixed target.
Let me give you a few examples (pseudo-asm syntax).

The 3 instructions are:
- set_jmp target
- set_jmp cond target0 target1 (set target to "target0" if "cond" is 0 and to "target1" otherwise)
- jmp2target

Here is a simple loop:

(prolog)
loop:
%r01 = sub.i32 %r01 1
set_jmp %r01 loop_end loop
(loop body)
jmp2target
loop_end:
(epilog)

So basically, the jump is always taken, and you know exactly the address of the jump in advance.
Of course, you need to know when load the instructions of the target address, but this can be handled by the branch predictor like a regular unconditional jump to a fixed address.

The same applies to if conditions:

%01 = (compute the condition)
set_jmp %r01 if_end if_then
(some unrelated computation)
jmp2target
if_then:
(conditional computation)
if_end:
(unconditional computation)

Of course, if you set the target too early, you need to wait, so you need to execute the "set_jmp" early enough, but the same problem arises with data prefetching anyway.

Author: Agner Date: 2017-01-31 01:40
csdt wrote:

The 3 instructions are:
- set_jmp target
- set_jmp cond target0 target1 (set target to "target0" if "cond" is 0 and to "target1" otherwise)
- jmp2target

Here is a simple loop:

(prolog)
loop:
%r01 = sub.i32 %r01 1
set_jmp %r01 loop_end loop
(loop body)
jmp2target
loop_end:
(epilog)

So basically, the jump is always taken, and you know exactly the address of the jump in advance.

OK. Now I think I understand you.

Your idea is somewhat like an extension to a "delay slot". Some old RISC architectures (MIPS, SPARC, OpenRisc) have a delay slot of one or two instructions after a jump. The instructions in the delay slot are executed before the jump is executed. The main problem with these delay slots is that they are optimized for a particular hardware implementation. A new implementation with a longer pipeline and more parallelism will need a longer delay slot.

This makes me wonder if it would be useful to have a variable-length delay slot. A branch instruction could have an 8-bit field indicating a delay slot of 0-255 instruction words. The compiler will place the delayed branch instruction as early in the code as it can. The instruction fetcher in the CPU will then know in advance what to fetch after the branching point. If the fetcher has already fetched past the branching point at the time the delayed branch is executed then it has to discard the extra instructions and fetch again. This will give a bubble in the pipeline, but this bubble will be smaller the longer the delay slot is.

There are known problems with delay slots: Interrupts, debug breakpoints, and jumps in the delay slot lead to extra complications.

We might overcome these problems by making the delayed jump instruction a "prediction hint" to be confirmed later by a regular branch instruction at the actual branching point. Only in the rare case that the prediction hint is lost or compromized due to interrupts or whatever will we have a misprediction at the branch point.

The delay slot has to be quite long. Modern microprocessors can execute 3 or 4 instructions per clock cycle and have a pipeline of 10-20 stages. With 3 instructions per clock on average and a 10-stage pipeline we would need 30 instructions to fill the delay slot. If no other jump instructions are allowed in the delay slot then we would rarely be able to make a sufficiently long delay slot.

Now, back to your proposal. I understand that you want to have a separate register for jump targets only. We would have to set this jump target register some 30 instructions in advance in order to keep the pipeline full. I think it is unlikely that we have no other branch instructions in this 30 instruction space, so we need more than one jump target register. If we need many jump target registers, then we might as well use the general purpose registers for jump targets instead. So, now your set-jump-target instruction just becomes ordinary instructions that calculate a jump target and put it into a register. And your jump-to-target instruction becomes a register-indirect jump. Now the problem becomes how to make the register value available to the branch prediction front end of a superscalar processor. The instructions that set the jump target are executed in the out-of-order back end, while the branch predictor operates in the in-order front end.

This makes me think, is it possible to execute simple instructions already in the in-order front end of a superscalar processor? If we can execute the instructions that calculate the jump target already in the in-order front end, then we have reduced the necessary delay to a few clock cycles. I am imagining a dual execution system: We have a system of in-order ALUs and out-of-order ALUs. Simple instructions are executed in the in-order ALUs if the operands are available already while the instruction is decoded in the in-order front ind. If the operands are not available yet, then the instruction is passed on to the out-of-order back end.

Here, we need a system to track all the general purpose registers to see if they are ready. We can have a "register cache" in the in-order front end. Every time the decoder sees an instruction that writes to a register, it will mark the entry for this register as dirty and remember which instruction it was. When this instruction is later executed (in order or out of order) the value is written to the register cache and marked as clean (unless another instruction writing to the same register has been encountered in the meantime). Simple instructions can execute in the in-order system if all their input operands are ready. The result is stored both in the register cache and passed on to the register renamer/allocater.

Branch instructions, jumps, calls, etc. should execute in the in-order front end. The compiler should make sure that any registers that the branch depends on are calculated as early as possible, and preferably using simple instructions that can execute in the in-order front end. If a branch depends on a memory operand, such as a jump table or table of function pointers, then the compiler should load the memory operand into a register as early as possible and then make a register-indirect jump.

A simple loop with a loop counter could benefit from this. The increment and compare instructions for the loop counter will be simple instructions that are likely to execute in the in-order front end so that the values are available early.

Multiway branches are particularly difficult to predict, and they could benefit from a system that calculates the target address in advance. For example, a byte code interpreter typically loads one byte at a time and uses this code as index into a multiway jump table. The software can easily read the next byte code in advance, fetch the target from the jump table and store it in a register.

This is an interesting discussion. The software may be able to calculate branch targets early in many cases, but the intricacies of out-of-order execution makes it complicated to forward this information to the in-order front end. Whatever solution we may come up with might be complicated, but still use less silicon space then the complicated branch prediction mechanisms that are used in modern microprocessors today.

Author: csdt Date: 2017-01-31 04:31
We now understand each other

The solution I propose shares some features with a delay slot, but solves the main issue: the delay slot is implementation-dependent.
In my solution, if you set the target too late, you will just wait and create a bubble in the pipeline, but the CPU will execute what you expect.

Agner wrote:

There are known problems with delay slots: Interrupts, debug breakpoints, and jumps in the delay slot lead to extra complications.

For interruptions, the target must be stored and loaded back at the end of the interruption. (You will probably lose the benefit of setting target early, but no undefined behavior here)
For jumps in the "delay slot" (between the target set and the actual jump), my first idea was to forbid this, and when this happens, trap when trying to jump with "jmp2target".
The way to solve this was to have a flag in the jump register indicating an incorrect value. If we try to jump with this flag -> trap.
This flag is set to 1 by default and reset to 1 every time a jump happens. This flag is then set to 0 when we define a target with "set_jmp".
Of course, these are just ideas and are adjustable.

Agner wrote:

Multiway branches are particularly difficult to predict, and they could benefit from a system that calculates the target address in advance. For example, a byte code interpreter typically loads one byte at a time and uses this code as index into a multiway jump table. The software can easily read the next byte code in advance, fetch the target from the jump table and store it in a register.

That was exactly that kind of code that makes me start this reflection on the subject (with vtables that are basically a jump table).

Now, I should mention that I'm not a hardware guy, but I had advanced Computer Architecture courses. (Basically, I know how it behaves, but I don't know how to implement it).

I think it would be better to have separate registers, that can live in the front-end world, and the "set_jmp" instruction is the interface between back-end and front-end worlds.
And in order to know if there is "set_jmp" being executed when the front-end has to fetch the "after-jump" instruction, we can have a flag in the special registers set by the front-end and telling an update of this register is pending.
In this case, you have to wait the register update by the back-end. In order to speed-up the thing further, the back-end could try to execute this instruction in priority.

Maybe you could add regular branch prediction when you have to wait. You will still benefit from the knowledge of the jump target by avoiding to flush the whole pipeline (only a part of it), but this could be much more complicated (I have no clue on this point).
I think it depends if you take this approach as the only way to branch, or if it is a complementary mechanism to branch.

I wonder about having multiple targets at the same time. It would be very handy for nested loops and nested control-flow in general, but some security issues arise.
For instance, with only one jump target, it is easy to detect a jump between "set_jmp" and "jmp2target" (see the beginning of this post). But with several of them, it becomes hard to detect without removing the benefit of having multiple jump targets.

I hope my explanations are understandable enough.

Author: Hubert Lamontagne Date: 2017-02-01 01:54
Agner wrote:

[...]
This makes me think, is it possible to execute simple instructions already in the in-order front end of a superscalar processor? If we can execute the instructions that calculate the jump target already in the in-order front end, then we have reduced the necessary delay to a few clock cycles. I am imagining a dual execution system: We have a system of in-order ALUs and out-of-order ALUs. Simple instructions are executed in the in-order ALUs if the operands are available already while the instruction is decoded in the in-order front ind.

That is sounding a lot like a Decoupled Access Execute architecture. Some documents about this:
https://cseweb.ucsd.edu/classes/wi09/cs ... oupled.pdf
http://personals.ac.upc.edu/jmanel/pape ... der CPUs)
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
http://spirit.cs.ucdavis.edu/pubs/conf/dynamicarch.pdf

One general problem with that kind of architecture is that it's hard to keep them simple, and they tend to end up with designs that are more complex than the simpler OOO cpus (in which case it's hard to argue against going with the OOO design).

If the operands are not available yet, then the instruction is passed on to the out-of-order back end.

Here, we need a system to track all the general purpose registers to see if they are ready.

A lot of simpler Out-Of-Order CPUs do something like this for memory address operands... For instance, the classic Athlon initiates memory loads and stores in-order. Due to this, instructions that generate memory addresses and don't depend on loads/stores will generally run quite earlier in the execution stream, and the readiness of the operands gets tracked by the scheduler. Maybe you could add a priority flag to instructions that affect control flow so that the scheduler will run them early (and possibly instructions that affect load/store addresses too).

We can have a "register cache" in the in-order front end. Every time the decoder sees an instruction that writes to a register, it will mark the entry for this register as dirty and remember which instruction it was. When this instruction is later executed (in order or out of order) the value is written to the register cache and marked as clean (unless another instruction writing to the same register has been encountered in the meantime). Simple instructions can execute in the in-order system if all their input operands are ready. The result is stored both in the register cache and passed on to the register renamer/allocater.

Branch instructions, jumps, calls, etc. should execute in the in-order front end. The compiler should make sure that any registers that the branch depends on are calculated as early as possible, and preferably using simple instructions that can execute in the in-order front end. If a branch depends on a memory operand, such as a jump table or table of function pointers, then the compiler should load the memory operand into a register as early as possible and then make a register-indirect jump.

A simple loop with a loop counter could benefit from this. The increment and compare instructions for the loop counter will be simple instructions that are likely to execute in the in-order front end so that the values are available early.

Multiway branches are particularly difficult to predict, and they could benefit from a system that calculates the target address in advance. For example, a byte code interpreter typically loads one byte at a time and uses this code as index into a multiway jump table. The software can easily read the next byte code in advance, fetch the target from the jump table and store it in a register.

This is an interesting discussion. The software may be able to calculate branch targets early in many cases, but the intricacies of out-of-order execution makes it complicated to forward this information to the in-order front end. Whatever solution we may come up with might be complicated, but still use less silicon space then the complicated branch prediction mechanisms that are used in modern microprocessors today.

Afaik, the case that really kills is this: Somewhat predictable branches that depend on memory operands (especially rarely used memory values that are loaded late and come in from L2 or later).

In C++ code this shows up as checks on rarely used or rarely changed flags to trigger rare special cases - for instance, logging errors on exceptional cases. This leaves yo with jumps that you're 99.99% sure will never be taken but will still take dozens of cycles to retire. I'm not sure I've seen any other solution to this than speculation and branch prediction (aside from the Transmeta Crusoe's strange backup-restore of the whole CPU state and Itanium's "insert omniscient compiler here" scheme)... and this mess has a kinda massive cost as far as I can tell (need to backup the result of every single instruction going through the pipeline, need large register file, need aggressive register renaming due to having so many renames per cycle).

Author: Agner Date: 2017-02-01 03:12
Hubert Lamontagne wrote:

That is sounding a lot like a Decoupled Access Execute architecture.

Thanks a lot for the links. Yes, this resembles what I am thinking about. The papers you are referring to are from the 1990's. I found a few newer articles on the topic, but no real progress:
http://www.jarnau.site.ac.upc.edu/isca12.pdf
http://www.diva-portal.org/smash/get/di ... FULLTEXT02

These articles still don't solve the problem of how to split the instruction stream into a control stream and a compute stream. This makes me think of the following idea. Let the compiler insert an instruction telling which registers are part of the control stream. For example, if register r1 is used as a loop counter, and r2 is controlling a branch, then tell the microprocessor that r1 and r2 should be privileged. Any instruction that has r1 or r2 as destination will go into the control stream. These instructions, as well as all branches, jumps and calls will be executed in the in-order front end as far as possible.

The front end should have access to the permanent register file. The back end may have out-of-order execution with register renaming. The renamed registers will be written to the permanent register file when they retire. We have to keep track of whether the permanent registers are up to date. When the decoder sees an instruction that writes to a register, it will mark this register as invalid in the permanent register file and remember which instruction it belongs to. When this instruction is executed (whether in the front end or the back end), the register entry is marked as valid, unless another instruction writing to the same register has been seen in the meantime.

The main problem we are trying to solve with this design is that the very long pipeline of superscalar processors is bad for branch instructions. The solution with a branch target register puts the branch in the front end, but not the calculation of the branch target. We would need perhaps 30 or 50 instructions between the calculation of the branch target and the branch itself in order to keep the pipeline full. To reduce this distance, we need to also put the calculation of the branch target, and everything that the branch depends on, in the front end as well. This includes not only branch targets, but also loop counters, branch conditions, indexes into jump tables, and all preceding calculations that these depend on. A separate set of registers and instructions for all this would be too complicated. I think it is easier to use the existing registers and tell the microprocessor which registers are part of the control stream. This wouldn't be too hard to implement in a compiler, I think. The question is how efficient we can make it in the front end.

Author: Hubert Lamontagne Date: 2017-02-01 17:21
There are two different ways to select values that can be sent to back-end:
- Value is never used in any memory address calculation
- Value is never used in any branch address calculation

They are actually somewhat orthogonal:
- A CPU can have a front-end that runs in-order but has register renaming and can be rolled back, which allows conditionals on the back-end but not memory index calculations
- A CPU can have a front-end that supports reordering but not speculative execution (all writebacks must be real!), allowing back-end indexing but not conditionals

Non-indexing and non-conditional values can be marked in SSA form: starting from the bottom, any value that gets "garbage collected" (goes out of scope, runs out of destination operations) without ever going to a memory load/store address or branch/jump-address can be marked, which in turn allows marking preceding values as non-indexing and/or non-conditional. The proportion of non-indexing and non-conditional values depends on the program (the test I did ended up with about 80% non-indexing values and 60% non-conditional values but this might not be typical at all). It might be very hard to perform this kind of marking through function calls or complex control flows though.

Author: Agner Date: 2017-02-02 01:11
Hubert Lamontagne wrote:

They are actually somewhat orthogonal:
- A CPU can have a front-end that runs in-order but has register renaming and can be rolled back, which allows conditionals on the back-end but not memory index calculations
- A CPU can have a front-end that supports reordering but not speculative execution (all writebacks must be real!), allowing back-end indexing but not conditionals

My intention is clearly number 2. The purpose is to shorten the branch misprediction delay to probably 3 clock cycles: fetch, decode, execute. The front end may allow reordering in the sense that an instruction can wait for an operand without delaying subsequent independent instructions. But it should not have register renaming in the front end.

I am focusing on branching here, which is clearly a weak spot in current superscalar designs. I am not intending to use this as a way of prefetching memory operands. The out-of-order mechanism is able to handle delayed memory operands satisfactorily.

My vision is this: The compiler is indicating which registers it intends to use for control flow. For example, it may use register r1 for loop counter, r2 for a branch condition, and r3 for index into a jump table. All instructions that modify r1, r2, and r3 will execute in the front end if possible. The remaining instructions are sent to the the back end. Instructions that have any of these registers as input operand will have the register operand replaced by its value before it is sent to the back end. Other register operands that are available in the permanent register file can likewise be replaced by their values. Instructions that execute in the front end may write their result not only to the permanent register file but also to any pending instruction that needs it.

The result of this mechanism is that a loop that is controlled by the marked registers only will be unrolled in the front end. The instructions in the loop body will be passed on to the back end with the loop counter replaced by its value. The back end has register renaming so that it can execute the body of the second iteration before it has finished the body of the first iteration.

All other control flow, such as branches, jumps, calls, returns, etc., will be unrolled in the front end as well so that control flow is effectively resolved before execution as far as it does not depend on delayed data.

The demand for advanced branch prediction is somewhat relaxed if we can drastically reduce the branch misprediction delay in this way. The branch prediction machinery in modern processors is very complicated and requires a lot of silicon space for branch target buffers and pattern/history tables. We can cut down on this and use a simpler branch prediction scheme if branches are resolved in the front end.

The front end may fetch and decode speculatively, but not execute speculatively. It may even fetch both sides of a branch that it expects to have poor prediction. Fetching both sides of a branch becomes cheaper here than with a traditional superscalar design because it only needs to duplicate two pipeline stages (fetch and decode) in order to minimize branch misprediction bubbles.

The fetch and decode stages should have higher throughput than the rest of the pipeline so that it can catch up and fill the pipeline after a branch bubble. The fetch stage should be able to fetch a large block of code in one clock cycle, possibly crossing a cache line boundary. The decoder should be able to decode the first one or two instructions in a fecthed code block in the first clock cycle, and determine the starting points of the remaining instructions. This will allow it to decode maybe four or more instructions in the second clock cycle after fetching a block of code. The aim of this is to keep the branch misprediction delay as low as possible.

In case the front end ALU is idle because there are no instructions that write to marked registers, it may execute any other instructions if their operands are ready.

Author: Agner Date: 2017-02-14 13:20
This idea about decoupling the control flow from execution is now included as a proposal in the manual version 1.06 (chapter 8.1).

http://www.agner.org/optimize/forwardcom.pdf

Author: Hubert Lamontagne Date: 2017-01-31 01:55
csdt wrote:

[...]
The idea is not to help the prediction, but to "override" the prediction for a single given jump instruction. So there is no failed prediction in the same way as you cannot "mispredict" an unconditional jump to a fixed target.
Let me give you a few examples (pseudo-asm syntax).

The 3 instructions are:
- set_jmp target
- set_jmp cond target0 target1 (set target to "target0" if "cond" is 0 and to "target1" otherwise)
- jmp2target
[...]

So basically, the jump is always taken, and you know exactly the address of the jump in advance. Of course, you need to know when load the instructions of the target address, but this can be handled by the branch predictor like a regular unconditional jump to a fixed address.
[...]

You'd need to compute the conditional at minimum ~10 instructions before the jump, even in the best case on the shortest pipelines (and with a bypass path straight from the register bypass network to instruction cache address input!). For instance, with a 3-instruction-per-cycle cpu:

set_jmp %r01 loop_end loop
nop ; loads on same cycle as the jump, conditional execution impossible
nop ; loads on same cycle as the jump, conditional execution impossible
nop ; starts loading while the jump is coming out of instruction cache
nop ; starts loading while the jump is coming out of instruction cache
nop ; starts loading while the jump is coming out of instruction cache
nop ; starts loading while the instruction aligner forms a 3 instruction group
nop ; starts loading while the instruction aligner forms a 3 instruction group
nop ; starts loading while the instruction aligner forms a 3 instruction group
nop ; starts loading while the register renamer is mapping %r01
nop ; starts loading while the register renamer is mapping %r01
nop ; starts loading while the register renamer is mapping %r01
jmp2target ; earliest instruction that could execute conditionally in the MOST optimistic case imaginable (but real CPUs have way many more front-end latency cycles)

If the conditional value doesn't depend on memory, and it doesn't depend on results very low down the computation chain, then this could theoretically work - for instance this could probably work on loop indexing (if LLVM can hoist the conditional all the way up the SSA), especially if the branch predictor serves as a fallback.

I guess this could also sorta make sense for floating point or SIMD conditional branches (which often have scheduler penalties and slow value hoisting paths).

Also, you don't need to calculate the branch target early (that can be supplied by the branch target predictor), only the conditional value. So maybe instead of a special jump, you could simply do this by letting the branch predictor have some special port that can read flag values early, which means that it could get 100% prediction if the flag values have been "cold" for long enough when it reaches the jump.

Post by **agner** » 2017-11-02, 14:05:01

Author: fanoI Date: 2017-03-21 03:41
Agner the code to do an addition between "big integers" that is at page 85 of the forwardcom manual is applicable to any CPU that can do vector integers addition? It could work with x86 using PADDB?

This idea is applicable to all other operators (-, *, /, &...) right?

Author: Agner Date: 2017-03-21 04:16
fanoI wrote:

Agner the code to do an addition between "big integers" that is at page 85 of the forwardcom manual is applicable to any CPU that can do vector integers addition?

Yes

It could work with x86 using PADDB?

Better use VPADDQ

This idea is applicable to all other operators (-, *, /, &...) right?

Only + and -.
* and / are more complicated. & is straightforward because it has no carry.

Post by **agner** » 2017-11-02, 14:06:00

Author: Joe Duarte Date: 2017-04-13 20:12
FYI, Intel has an upcoming solution for protecting return addresses: https://software.intel.com/sites/defaul ... review.pdf

It would be interesting to compare ForwardCom's approach to CET.

Author: Agner Date: 2017-04-14 00:53
Joe Duarte wrote:

Intel has an upcoming solution for protecting return addresses: https://software.intel.com/sites/defaul ... review.pdf

It would be interesting to compare ForwardCom's approach to CET.

Thanks for the info. Intel's solution is quite complicated because it has to be compatible with legacy code and legacy processors. ForwardCom's solution is quite simple: A separate call stack which stays in the CPU most of the time, and jump tables placed in read-only memory. This just shows the difference between patching an insecure system and designing a system to be secure from the beginning.

Post by **agner** » 2017-11-02, 14:07:06

Author: Agner Date: 2017-07-27 02:06
I am working on making an assembler for ForwardCom, and I think that the design of assemblers could use an update, now that we are redesigning everything anyway. I found that it was not too hard to make the assembly syntax look very much like C or Java, rather than the terrible syntaxes of many current assemblers. This will surely make it easier to write assembly code and make the code easier to read. My plans for the assembler so far are as follows:

Instructions look like C code. For example to add two registers:

Code: Select all

int32 r1 = add(r2, r3)

Or simply:

int32 r1 = r2 + r3

Memory operands are indicated with square brackets, where the length of vector registers is indicated if necessary:

Code: Select all

float v1 = [r1 + 0x100, length = r5] // read memory operand of r5 bytes into vector register v1

Masked instructions can be coded conveniently with the ?: operator:

Code: Select all

int32 r1 = r4 ? r2 + r3 : r5 // add r2 and r3 with boolean in r4 as mask and r5 as fallback value

This is still assembly, because you make one machine instruction per line, and the programmer has to decide which registers to use, but it looks like high-level language.

Branches and loops can be coded either with conditional jumps or as high-level constructs like if, else, switch, for, while, do, break, continue, with {} for block nesting.

I also intend to support preprocessing directives like #define, #include, #if; #else, #endif.

Now, a more tricky problem is macros and metaprogramming. Most assemblers have macro features that allow metaprogramming, but the syntax is often complicated, confusing, and inconsistent. Such features are probably needed, but I would like a more logical syntax. I have looked at high-level languages to see if they have useful metaprogramming features. Metaprogramming in C++ is possible with templates, but this is very convoluted because the template feature was designed for a more narrow purpose. I would prefer something more similar to the support for self-modifying code in script languages. You make a string variable and then issue this string to the assembler and have it interpreted as assembly language.

We need a way to distinguish between code that is generated for run-time execution, and meta-code that is executed in the assembler. My idea so far is to use a dot (.) or percent (%) or some other symbol to indicate assembly-time meta code. For example, a loop that adds the numbers from 1 to 10 at runtime will look like this:

Code: Select all

int64 r0 = 0
for (int64 r1 = 1; r1 <= 10; r1++) {
int64 r0 += r1
}

A similar assemble-time loop could look like this:

Code: Select all

. sum = 0 // define assembly-time variable sum
.for (i = 0; i <= 10; i++) { // dot indicates assembly-time loop
. sum += i
}
int64 r1 = sum // runtime code r1 = 55

This will calculate the sum at assembly time and just generate a single runtime instruction setting register r1 = 55.

The assembly-time variables will be dynamically typed, using 64-bit integers, double, or string, depending on the context. All common operators are supported. String expressions can be converted to assembly code. For example:

Code: Select all

. text = " nop \n"
. emit text + text /* This will send the string " nop \n nop \n" to the assembler and produce two nop instructions */

With this proposal, we have three kinds of code. For example, "#if" will branch at the preprocessing stage, ".if" will branch at the assembly metaprogramming stage, and "if" will generate executable code that branches at runtime. It may be confusing with three types of code, but at least the "#" and "." prefixes will make the distinction clear.

The preprocessing directives (with #) are executed before the assembler interprets any code, so it has no knowledge of anything defined in the code, while the metaprogramming features (with .) can check e.g. if a function is defined before making code that calls it, or it can make different code depending on the type or size of a defined data object.

I would like to hear your comments before I implement all this.

Author: Kai Rese Date: 2017-08-11 09:13
This looks like the most convenient and easiest to use assembler I have ever seen. I'm really looking forward to playing around with it.

Are there reasons why the preprocessing stage and the metaprogramming stage can't be merged into one stage?
This would reduce the assembler to only two kinds of code, but I don't know if there would be complications.

Author: Agner Date: 2017-08-11 10:00
Kai Rese wrote:

Are there reasons why the preprocessing stage and the metaprogramming stage can't be merged into one stage?

The preprocessor has no knowledge of anything defined in the code. For example, it irritates me that you cannot do something like
#if defined(myFunction)
or
#if sizeof(myStructure) > 8
in C or C++.

The need for metaprogramming is higher in an assembler than in a compiler, and it would be nice to have metaprogramming features that can query information about defined symbols.

I think it is convenient to have preprocessing directives like #include and #if resolved before the assembler begins to interpret anything else because programmers are used to this concept and it is completely transparent what happens. If we would use some kind of metaprogramming for the same purpose then it becomes less obvious at what stage in the process these directives are executed.

I have encountered a problem with this strategy, though. Consider the following code, where % means define an assemble-time variable:

%A = R1
%B = R2
A = B

This is ambiguous. It could generate the runtime instruction R1 = R2, or it could change the assemble-time variable A to R2. A possible solution to this dilemma is to define a symbol for alias names. For example:
& parameter1 = R0
could define parameter1 as an alias for register R0. This would be useful for function parameters and local variables, but it would generate further confusion with three different prefixes: #, %, &

Any better ideas?

Author: Kai Rese Date: 2017-08-14 22:33
Agner wrote:

I think it is convenient to have preprocessing directives like #include and #if resolved before the assembler begins to interpret anything else because programmers are used to this concept and it is completely transparent what happens. If we would use some kind of metaprogramming for the same purpose then it becomes less obvious at what stage in the process these directives are executed.

This is a valid point, I still have to think about that.

My idea was to use metaprogramming as an extension to the preprocessor. For example, variables defined on meta-level could be treated as preprocessor-macros by the actual assembler. This policy also solves the ambiguity problem:
Agner wrote:

Code: Select all

    I have encountered a problem with this strategy, though. Consider the following code, where % means define an assemble-time variable:

    %A = R1
    %B = R2
    A = B
    This is ambiguous. It could generate the runtime instruction R1 = R2, or it could change the assemble-time variable A to R2.

If the symbol always gets set in the meta-code and always gets inserted in the assembler code, this example would always produce:
R1 = R2
We usually do not know the content of a register at assemble time. For that reason, we should be able to automaticly make a symbol an alias when defining with a register. Function arguments should also be able to be defined as aliases automaticly.

Agner wrote:

A possible solution to this dilemma is to define a symbol for alias names. For example:
& parameter1 = R0
could define parameter1 as an alias for register R0. This would be useful for function parameters and local variables, but it would generate further confusion with three different prefixes: #, %, &

This is a trade-off that has to be made. How many features do we need and how much complexity is still easy enough to use?
My vision was usage of the metaprogramming-features in a template-like fashion. For example, this is a template for an engine written in python, called mako. It outputs html-code:

Code: Select all

<%inherit file="base.html"/>
<%
    rows = [[v for v in range(0,10)] for row in range(0,10)]
%>
<table>
    % for row in rows:
        ${makerow(row)}
    % endfor
</table>
<%def name="makerow(row)">
    <tr>
    % for name in row:
        <td>${name}</td>\
    % endfor
    </tr>
</%def>

My suggestion is aiming for a fairly simple code, maybe at the price of a more complex interpretation. I really don't know if this would be a better way than the current way of using different prefixes. It could also cause more problems with debugging, as you mentioned. There also might be many other problems I'm currently not seeing, but it is an idea nonetheless.

Author: Agner Date: 2017-08-14 23:44
Kai Rese wrote:

If the symbol always gets set in the meta-code and always gets inserted in the assembler code, this example would always produce:
R1 = R2

The meta-code still needs access to its own variables, for example a loop counter. Perhaps we can use a percent sign to indicate that a meta-variable is set or modified.

%A = R1
%B = R2
A = B // produces R1 = R2
%A = B // modifies A to equal R2

Author: Kai Rese Date: 2017-08-15 21:04

With what we have so far, it should work like this:

Code: Select all

%A = 0 // sets A as value
%B = A + 5 // allowed if A is a value

%A = R1 // sets A as alias
%B = A + 5 // invalid as A currently is an alias

%A = R1 + 5 // could be allowed as complex symbol, could also get messy

int64 R2 = A // allowed if A is defined, could be a move or a set instruction
int64 R2 = A + 5 // allowed if A is a value, otherwise not allowed

A = R2 // allowed if A is an alias, otherwise not allowed

I used the percent sign as indicator for metacode. I prefer the percent sign over dots because dots are fairly easy to overlook.
We also could make an alias require a special or different sign. It might benefit functionality, but could lead to more confusing code.

Post by **agner** » 2017-11-02, 14:16:00

Author: Hubert Lamontagne Date: 2017-08-22 12:46
I've taken another quick look today at the specs to figure out how feasible a Verilog implementation running on a FPGA would be, and I'm finding that register file read port counts are on the higher end... 2 ports per AGU (address generating unit), plus 4 ports per ALU (3 operands + 1 mask) and 4 ports per vector unit (3 operands + 1 mask)... This also affects the size of the input dependency tracking mechanism (and, I guess, the register renamer).

So I'm wondering, how many ports can modern register files manage, how does it scale with chip size and frequency, and how many ports are realistic on a modern mid-sized FPGA? (like, a 30$ FPGA or something?)

Author: Agner Date: 2017-08-23 00:37
Hubert Lamontagne wrote:

register file read port counts are on the higher end... 2 ports per AGU (address generating unit), plus 4 ports per ALU (3 operands + 1 mask) and 4 ports per vector unit (3 operands + 1 mask)... This also affects the size of the input dependency tracking mechanism (and, I guess, the register renamer).

The idea of ForwardCom is that it should be a high-end vector processor comparable to x86, ARM, etc. but with a better design. The AGU needs addressing modes with base and index as well as the special addressing mode for vector loops. The ALU needs 3 operands for fused multiply-and-add which is required for a state-of-the-art processor, and I think that masking is also required for a high-performance vector processor.

I have limited the number of output registers to one in order to simplify dependency tracking and register renaming, following our previous discussion. This means no flags register, no conventional add-with-carry instruction, and no single-instruction stack POP.

Current x86 processors have at least the same number of register inputs, plus segment register, and at least two register write ports.

My idea is to combine the best from RISC and CISC. The weakness of CISC is that the decoding is complicated, and it is difficult to decode multiple instructions in one clock cycle. They are caching decoded instructions to deal with this, which is a waste of silicon space. ForwardCom deals with this by making decoding simple. The weakness of RISC is that you need more instructions to do the same. For example, RISC-V needs a long sequence of instructions just to calculate the address of a memory operand. If you can do the same with a single instruction, you get a much higher overall throughput.

Why do you think register read ports are so expensive?

Author: Hubert Lamontagne Date: 2017-08-27 10:12
Agner wrote:

The idea of ForwardCom is that it should be a high-end vector processor comparable to x86, ARM, etc. but with a better design. The AGU needs addressing modes with base and index as well as the special addressing mode for vector loops. The ALU needs 3 operands for fused multiply-and-add which is required for a state-of-the-art processor, and I think that masking is also required for a high-performance vector processor.

I have limited the number of output registers to one in order to simplify dependency tracking and register renaming, following our previous discussion. This means no flags register, no conventional add-with-carry instruction, and no single-instruction stack POP.

Current x86 processors have at least the same number of register inputs, plus segment register, and at least two register write ports.

My idea is to combine the best from RISC and CISC. The weakness of CISC is that the decoding is complicated, and it is difficult to decode multiple instructions in one clock cycle. They are caching decoded instructions to deal with this, which is a waste of silicon space. ForwardCom deals with this by making decoding simple. The weakness of RISC is that you need more instructions to do the same. For example, RISC-V needs a long sequence of instructions just to calculate the address of a memory operand. If you can do the same with a single instruction, you get a much higher overall throughput.

Why do you think register read ports are so expensive?

I think this is one of those kind of things where it really depends on which tech you're aiming... Intel and AMD have huge register files with tons of ports on their modern cores. That probably takes up tons of silicon and suck out tons of power (and is probably hard to get running at very high clock rates). But that kind of chip is your target, so it's appropriate. Come to think of it, you're reminding me of the whole segment thing, and that's making me quite curious as to how x86 cores track that - it's the kind of thing you'd possibly want to keep in a separate register file and give it its own register renamer to reduce the level of contention with the rest of the registers.

On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.
Here's a paper on this.

As for RISC-V, they're obviously targeting smaller implementations (kindof like ARM), which are often in-order with 1 or 2 instructions per cycle. At that performance point (and in small out-of-order CPUs), this MIPS-inspired design is pretty much the optimal design (along with ARM-style), as far as I can tell. Though you do have a point, that they might be overdoing simplicity by not having R+R*n addressing modes - although the performance hit of that probably varies a lot per application. I think a very large proportion of memory reads in non vector code are stack pointer or object pointer + offset, which probably reduces the impact of lacking more complex addressing modes.

One thing ARM has done is to keep the number of input dependencies per uop low at 2 for the integer part of the core (these ops HAVE to run at low latency), and let vector ops have 4 inputs (you can see the cost of this if you look at result latency for NEON ops - often 3+ cycles!).

Author: Agner Date: 2017-08-28 01:35
Hubert Lamontagne wrote:

On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.

What do you think of an implementation like this:

There is a permanent register file, but no temporary register files. Temporary register values are stored instead in the instructions that need them in the reservation station. Each instruction has its register operands replaced by their values immediately after the decode stage if the values are available in the permanent register file. Register operands that are not yet available are replaced by the tag of the pending instruction that will produce the value. Each ALU has a write port that writes the result to all pending instructions that needs it, as well as to the permanent register file. ForwardCom does not allow instructions with more than one output value.

So we will need four or five register read ports in the early stage in the pipeline for reading the permanent register file, but no read ports for a huge temporary register file.

Vectors are implemented as follows. The pipeline is split into 64-bit lanes. Each lane has its own 64-bit register file, reservation station, pipeline, and 64-bit ALU. We can add as many 64-bit lanes as we need to support a certain vector length. There is one scheduler which controls all the lanes. This can run smoothly and fast as long as there is no exchange of data between the lanes. Instructions that move data between the lanes, such as broadcast and permute, are implemented in a separate unit with a single bidirectional data path to all the lanes. Such instructions will take one or two clock cycles extra, as they do in x86 processors.

The in-order front end has its own ALU's which can handle simple integer instructions. Simple instructions on general purpose registers are executed here if their operands are available (as I have proposed in a previous post). Loop conters and simple branches will be executed in the front end so that the instructions of the loop body are sent to the out-of-order back end with the loop counter replaced by its value. This will reduce the branch misprediction penalty. Registers used for memory pointers and indexes will also mostly be resolved in the front end so that the memory operands can be fetched early.

Author: Bigos Date: 2017-08-28 15:36
Agner wrote:

Hubert Lamontagne wrote:
On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.
What do you think of an implementation like this:

There is a permanent register file, but no temporary register files. Temporary register values are stored instead in the instructions that need them in the reservation station. Each instruction has its register operands replaced by their values immediately after the decode stage if the values are available in the permanent register file. Register operands that are not yet available are replaced by the tag of the pending instruction that will produce the value. Each ALU has a write port that writes the result to all pending instructions that needs it, as well as to the permanent register file. ForwardCom does not allow instructions with more than one output value.

So we will need four or five register read ports in the early stage in the pipeline for reading the permanent register file, but no read ports for a huge temporary register file.

I think this is how most x86 architectures worked till Sandy Bridge on Intel side and Bobcat/Bulldozer on AMD side. Later they have switched to PRF-like architecture (Physical Register File) where data only moves where necessary and is not tied to the pipeline at the early stages. The reason for that was power - moving so much data (especially with 256-bit vectors introduced in Sandy Bridge) took a lot of it. There were also read ports issues on P6 architectures till Sandy Bridge that you mentioned in your microarchitecture paper which were probably fixed just because of that - meaning it was probably not scalable enough.

Vectors are implemented as follows. The pipeline is split into 64-bit lanes. Each lane has its own 64-bit register file, reservation station, pipeline, and 64-bit ALU. We can add as many 64-bit lanes as we need to support a certain vector length. There is one scheduler which controls all the lanes. This can run smoothly and fast as long as there is no exchange of data between the lanes. Instructions that move data between the lanes, such as broadcast and permute, are implemented in a separate unit with a single bidirectional data path to all the lanes. Such instructions will take one or two clock cycles extra, as they do in x86 processors.

Unless each vector pipeline uses the same register names for each part of the same bigger register, you would have to make these whole-register operations take N times as many sources (where N is the number of 64-bit pipelines), which would probably we an overkill for a scheduler.

Also, I don't see what would be the benefit of such a partitioning? Of course, some level of partitioning is needed (for example for power-gating unused data paths or separating some per-line units like adders), but that is physical (you could split it for each bit if you wanted); I don't think it would help the register files much - you still need as many read and write ports.

The in-order front end has its own ALU's which can handle simple integer instructions. Simple instructions on general purpose registers are executed here if their operands are available (as I have proposed in a previous post). Loop conters and simple branches will be executed in the front end so that the instructions of the loop body are sent to the out-of-order back end with the loop counter replaced by its value. This will reduce the branch misprediction penalty. Registers used for memory pointers and indexes will also mostly be resolved in the front end so that the memory operands can be fetched early.

Also, about the register file, the most problematic are the write ports. You can always double the number of read ports by duplicating the register file and having each write take place in both copies; the scaling is less than linear due to the writes being more complicated (have to drive more inputs), but is mostly fine. Minimizing the number of write ports is the most important thing.

You can also partition your register space into multiple classes, each of these having a different register file (Intel does it with the kN mask registers). This effectively resets your register file port count, unless you have to duplicate all of the ports in order to support all of the possible register data transfers at the same time. You can, for example, partition the scalar and vector registers into different files (like AMD does) in order to increase the maximum possible throughput if you use both scalar and vector instructions at the same time (Zen can handle up to 10 instructions partially thanks to that, though only for short bursts due to different bottlenecks elsewhere in the pipeline; Intel can only do 7, not counting the separate store port)

BTW, a similar partitioning scheme can be used for caches (the most obvious being I$ vs D$). For example, the Mill architecture has 2 different I$ partitions - one for instruction chunks going forward and another for the ones going backward. Each jump target is really a middle of code and each instruction is halved, with half going forward and half going backward. Since each instruction half will only ever go to a single cache, you effectively double the cache capacity without compromising the latency and complexity (just 2x area and power for having twice as much memory).

https://millcomputing.com/docs/encoding/

Author: Hubert Lamontagne Date: 2017-08-28 17:05
Agner wrote:

Hubert Lamontagne wrote:
On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.
What do you think of an implementation like this:

There is a permanent register file, but no temporary register files. Temporary register values are stored instead in the instructions that need them in the reservation station. Each instruction has its register operands replaced by their values immediately after the decode stage if the values are available in the permanent register file. Register operands that are not yet available are replaced by the tag of the pending instruction that will produce the value. Each ALU has a write port that writes the result to all pending instructions that needs it, as well as to the permanent register file. ForwardCom does not allow instructions with more than one output value.

So we will need four or five register read ports in the early stage in the pipeline for reading the permanent register file, but no read ports for a huge temporary register file.

Hmm, interesting. I think I've read that some Intel P2 family cores work something like this somewhere. Though I'm a bit worried that this might make the Reservation Station rather big, since each input operand of each instruction in the Reservation Station has to watch all the write ports all the time and multiplex it in... So your reservation station entry needs a register per operand, plus a multiplexer for each operand so that the appropriate result can be saved. This is kindof moving the hugely-multiple-write-ported register file into the reservation station.

I think in that case your permanent register file also still needs about as many ports as in a design where the same register file is used both for permanent and temporary values.

It might make sense to go for a design where registers are renamed to different physical registers depending on which unit produces them. For instance, ALUs and memory load units could have a different physical register pool internally (even though they look the same to programmers) so that they don't have to share write ports, and the register renamer can relatively easily keep track of what's in which pool. Generally for FPGAs, register files with many load ports and a single write port tend to be the better plan - the Verilog tool simply creates as many register file copies as it needs.

Vectors are implemented as follows. The pipeline is split into 64-bit lanes. Each lane has its own 64-bit register file, reservation station, pipeline, and 64-bit ALU. We can add as many 64-bit lanes as we need to support a certain vector length. There is one scheduler which controls all the lanes. This can run smoothly and fast as long as there is no exchange of data between the lanes. Instructions that move data between the lanes, such as broadcast and permute, are implemented in a separate unit with a single bidirectional data path to all the lanes. Such instructions will take one or two clock cycles extra, as they do in x86 processors.

Quite sensible and I'd bet dollars to pennies that most SIMD units are built this way ;3

The in-order front end has its own ALU's which can handle simple integer instructions. Simple instructions on general purpose registers are executed here if their operands are available (as I have proposed in a previous post). Loop conters and simple branches will be executed in the front end so that the instructions of the loop body are sent to the out-of-order back end with the loop counter replaced by its value. This will reduce the branch misprediction penalty. Registers used for memory pointers and indexes will also mostly be resolved in the front end so that the memory operands can be fetched early.

Hmm, I wonder if it's possible to track which instructions/operands/values are "front end" and tag them in the instruction cache, so that they can be separately issued? Using some kind of heuristic like "result used in address computation" or "result not coming out of memory load/sufficiently stale by the time it gets used". Like, a front/back-endness predictor ;3

Post by **agner** » 2017-11-02, 14:20:39

Author: yeengief Date: 2017-09-20 14:03
Re variable instruction lengths: Consider some forced alignment, e.g. every 16-byte (or every 32-byte) boundary MUST begin a new instruction. This has minimal impact on the effective code density (if your compiler is not stupid), while allowing almost random access decoding of instructions. This is good for the processor (short dependency chains with respect to instruction decoding/parsing, as opposed to infinite), and even more important for reverse engineering software, debuggers, etc. Even better: Declare that only 16-byte aligned addresses are valid jump targets.

Effect: You can decode a huge binary, and be sure that all valid jump targets have been found. It is easy to check whether some instruction is contained in the code or not: use grep. Your compiler should be able to ensure alignment without issuing too many NOPs.

Enforce "write XOR execute". This has a lot of advantages. Firstly, security-- but this should rather be the application programmer's job.

Secondly, simplified cache coherency: instructions and data use separate caches, and attempting to ensure that a write to an instruction already in the pipeline does correctly roll back... yeah. On ARM you need to explicitly flush the instruction cache for this to work.

Third, and the biggest advantage: Consider QEMU. Emulating ARM is a breeze: you KNOW where code is located, and every jump/instruction cache flush triggers software decoding, translation to e.g. x86 and optimization (LLVM), caching of the result, etc, and then continuation.

Do not try this with x86. You don't know where instructions start, you don't know whether the code is self-modifying, everything sucks.

Btw: self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).

Indeed, I imagine that your ISA could first find adoption as an intermediate representation (like llvm IR)-- if you ensure that it is very fast to jit/compile from your language to fast x86 and arm, and pretty fast to verify properties of code.

Bonus request: Performance-hint/NOP: Allow the instruction to hint at likely-ness, often the compiler/programmer can predict the branch pretty well.

Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there. Or maybe provide a separate magic memory location for this, if the stack would clash too much with calling conventions.

Desired effect: The compiler/programmer is limited by the number of addressable registers. Often, you will need to temporarily store stuff you need later. In reality, more registers exist. Allowing the processor to avoid hitting the L1-cache and using a renamed/anonymous register should give huge performance boosts in future processors with more anonymous registers running legacy code.

At the same time, this makes dependency analysis (in software and hardware) easier.

Bonus request: Hardware support for constant-time operations. It turns out that this is quite important in a lot of crypto protocols (timing side-channels). I am not entirely sure how to best help here, but if it was possible to get an accurate cycle counter, and a NOP-until-cycle-xyz instruction, this would probably be supremely useful. Both instruction (count-cycle and wait-until-cycle) are allowed to take very long: these are rare instructions, and it is totally fine to take 10000 a cycle hit! On the other hand, these would also simplify a lot of debugging and performance testing.

Re floating point exceptions: I think control registers, trapping and fault-handling are the right way to go for "extremely unlikely" branches. This way, you basically have a conditional branch on every memory access or floating point operation (page fault?), which has a very high cost for mispredictions, and is very cheap on correct predictions (effectively zero overhead).

Author: Agner Date: 2017-09-20 15:23
Thanks for your thoughtful suggestions. This project may become a sandbox for many experiments and research on such ideas for more efficient computer systems.

yeengief wrote:

Re variable instruction lengths: Consider some forced alignment, e.g. every 16-byte (or every 32-byte) boundary MUST begin a new instruction. This has minimal impact on the effective code density (if your compiler is not stupid), while allowing almost random access decoding of instructions. This is good for the processor (short dependency chains with respect to instruction decoding/parsing, as opposed to infinite), and even more important for reverse engineering software, debuggers,

Currently, the ForwardCom ISA allows three different instruction lengths: 1, 2, and 3 words of 32 bits each. Decoding is easy because the instruction length is determined by only two bits. Forced alignment is applied only to tiny instructions of half-word size, which must be packed two-by two at word boundaries, and you can't have a jump target in between. The problems you are trying to solve certainly exist in the x86 world where it is very complicated to determine the length of an instruction, and some compilers put data in the code segment (i.e. jump tables). But I don't see any big problems in ForwardCom. I found that variable instruction length is important because it is difficult for the compiler to predict whether the distance for a relative jump will fit into a certain instruction size. It is easier to leave it to the assembler to find the optimal size of each instruction.

Regarding reverse engineering, I have allready made a disassembler. It works so well that the code can be re-assembled. I am soon finished making a high-level assembler. The syntax resembles C, but the programmer has to select the individual instructions and registers. I will put it on github later this autumn.

Enforce "write XOR execute".

Of course. Executable code is in execute-only sections by default. Execute-and-read access is possible, but write acces should not be allowed. Jump tables and other code pointers are in read-only sections. Self-modifying code is not allowed.

Bonus request: Performance-hint/NOP: Allow the instruction to hint at likely-ness, often the compiler/programmer can predict the branch pretty well.

Branch prediction matters only for branches that are executed millions of times. Static branch prediction affects only the first few executions of the millions, while dynamic branch prediction affects them all. If I understand you right, the likely-ness hint is a form of static branch prediction, so I don't see the point. The compiler may place the often-executed branches or functions in a hot section and the rarely executed branches, such as error handling, in a cold section that rarely pollutes the cache. I see this as a pure compiler issue which does not affect the ISA design.

The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there.

Each thread has private stack memory by default which is not accessible to other threads, except when legacy software requires otherwise. I am not sure I understand you, but maybe this is what you mean?

Hardware support for constant-time operations. It turns out that this is quite important in a lot of crypto protocols

Funny that you should mention this. I got the same idea a week ago. The next version will have optional support for a mask bit that dictates that execution time should be independent on the values of operands. This does not necessarily imply that execution time is predictable, only that it is independent of the values of sensitive data. Branches can be replaced by predicated instructions, and small lookup tables can be replaced by vector extract or vector permutation instructions.

Re floating point exceptions: I think control registers, trapping and fault-handling are the right way to go for "extremely unlikely" branches. This way, you basically have a conditional branch on every memory access or floating point operation (page fault?), which has a very high cost for mispredictions, and is very cheap on correct predictions (effectively zero overhead).

Fault trapping can be turned on or off for both floating point errors and integer overflow. Only problem is that the behavior depends on the vector length if an error occurs in a single vector element. NAN propagation is offered as an alternative if you want to trace errors to a single vector element and you want identical behavior on processors with different vector length.

Author: yeengief Date: 2017-09-20 16:22
>Currently, the ForwardCom ISA allows three different instruction lengths: 1, 2, and 3 words of 32 bits each. Decoding is easy because the instruction length is determined by only two bits.

The first problem with variable instruction lengths is that code has, in worst case, three different interpretations, depending on the entrypoint. I agree that this price is likely worth paying for the increased code density.

My problem is the long-range dependency of the correct decoding, depending on the entrypoint. Without restrictions, this dependency is effectively infinite-length. Q: what has x86 code and DNA in common? A: You need a hidden markov model to guess at the decoding if you don't know the entrypoints. This is bad, I don't want to care about BLAS when writing a disassembler.

To take a larger number: Suppose you say that every aligned 16 word block (nice, cache-line!) MUST begin a new instruction. Now, the length of the dependency of the "correct reading frame" for your code has shrunk down to one cache line.

What does this cost us? The very worst case is that we need to pad two NOPs when trying to emit a 3-word instruction near the end of a line. In other words, the code becomes 2/16 = 12.5% less dense, if the compiler can do no re-orderings at all. Code where every instruction is dependent on the previous one will be dog-slow anyway, so meh.

My maybe too strict initial proposal of cutting at 4 words, would give a maximal overhead of 50%, i.e. double the code length-- for code where no reordering is possible, and the lengths are maximally bad. Since instructions of length one are most frequent, and most of the time you have a little bit of freedom to reorder, this should not occur too often.

Restricting jump targets to certain alignments should likewise have only limited effect on code-length, except for the case where we have tiny tiny loops. Unfortunately, only allowing jumps to beginnings of cache-lines is probably too restrictive, otherwise we could get unambiguous decoding. This would simplify a lot of stuff, e.g. because the processor designer could decide to dynamically translate code into some internal uops, unambiguously and cached and independent of actual execution flow. In other words: We would need to decode instructions only on mapping pages as executable.

Also, it would simplify verification of code properties. For example, NaCL (google native client) uses such alignment restrictions to make the decoding of NaCL-compliant code statically unambiguous.

Execute-and-read access is possible, but write access should not be allowed.

Now that you mention it, why not go full Harvard? Would it be really bad to disallow read access to executable pages? Again, the idea is dynamic translation: the original code you want to read might have been swapped out to tape years ago, and if you really need access, just map it into a different page.

Sidenote: Apple's iOS is almost Harvard. Mapping a page to executable triggers code signature checks and loads of crypto, and removes writable. I think they still allow read access, though.

Author: Agner Date: 2017-09-20 23:52
yeengief wrote:

the processor designer could decide to dynamically translate code into some internal uops, unambiguously and cached and independent of actual execution flow.

Many x86 processors now have a micro-op cache because the decoding of instructions is a bottleneck. The ForwardCom ISA is designed so that decoding is very simple, never requiring more than one pipeline stage or one clock cycle. My idea is that a micro-op cache should never be needed. It should be easy to decode several consecutive instructions in parallel.

Regarding Harvard architecture, I imagine that there will be a code cache and a data cache, and perhaps a third cache for read-only data sections. But the hardware designer is of course free to choose some other configuration. The call stack and data stack are separate. The call stack will be so small that it can fit into its own little cache throughout the execution of a program, except for deeply recursive functions.

Author: yeengief Date: 2017-09-21 07:54

Many x86 processors now have a micro-op cache because the decoding of instructions is a bottleneck. The ForwardCom ISA is designed so that decoding is very simple, never requiring more than one pipeline stage or one clock cycle.

Yes, but decoding is still sequential and not parallel. My proposal makes decoding ridiculously parallel.

New number proposal: 8 words, a half cache-line. The beginning of every 8 word block must begin a new instruction. Only the beginning of 8-word blocks are valid jump targets. Hence, I can never jump into the middle of an instruction. (Alternative proposal: Decoding of instruction lengths is always relative to the most recent 8-word boundary. Entry into the middle of an instruction is a fault. Worst case price: The decoder must look back 7 words on a jump and decode the instruction lengths in order to determine whether the jump was faulty. This does not require a lot of extra instruction fetching because these 7 previous words are in the same cache-line anyway).

Worst case overhead for instruction alignment restriction: 25% is NOP. Worst case for jump target restriction: Harder to compute, but compilers should be able to manage without too many NOPs.

Gain: Decoding is ridiculously parallel, because I can decode every 8-word block independently. If decoding becomes the bottleneck, I just plug in more decoding units (i.e. we might still be limited by decoding latency, but never by decoding throughput). The longest possible dependency chain in decoding is 7. Your proposal has infinite dependency chains in decoding (you must know the reading frame).

I can statically analyze code coming in 8 word-blocks and be sure to never have mis-decoded an instruction.

Another bag of problems we can avoid: Suppose some program uses code in two valid decodings, dependent on entrypoint. This means that your branch predictor, uop-cache, etc must work on the tuple (memory location, reading frame), where reading frame is one of 0,1,2. With my proposal you cut this down to just memory location, which simplifies the logic a lot-- because the logic for caching uops or other processor generated metadata is the same logic as for caching data and just maps address -> stuff.

The only real-life code that does such stuff (e.g. uses a 2-word instruction as a different instruction by jumping into the middle) is shellcode. While shellcode is fun and all, this is maybe not the most important customer group. You target "real" compilers, and not return oriented programming compilers.

Author: Agner Date: 2017-09-21 12:54
yeengief wrote:

Yes, but decoding is still sequential and not parallel.

My plan is indeed that decoding in parrallel should be possible. Assume that you are loading 32 bytes (= 8 words) of code in one clock cycle. Each 4-byte word has two bits that may determine instruction length. This gives 16 bits in all. You need a combinational logic circuit with 16 inputs to determine where all the instruction boundaries are. This can easily be done in one clock cycle. There is no reason to decode too far ahead because mispredicted branches may spoil the advantage. I think it is better to spend your resources on decoding multiple possible branches simultaneously when facing a branch with poor prediction. And, as I said, a micro-op cache is not needed.

Author: yeengief Date: 2017-09-21 14:47
>My plan is indeed that decoding in parallel should be possible. Assume that you are loading 32 bytes (= 8 words) of code in one clock cycle. Each 4-byte word has two bits that may determine instruction length. This gives 16 bits in all. You need a combinational logic circuit with 16 inputs to determine where all the instruction boundaries are. This can easily be done in one clock cycle. There is no reason to decode too far ahead because mispredicted branches may spoil the advantage. I think it is better to spend your resources on decoding multiple possible branches simultaneously when facing a branch with poor prediction. And, as I said, a micro-op cache is not needed.

If you really, really don't want this,.... But I am telling you that you will regret this choice the next time you need to software-decode gigabytes of data (e.g. malware sample database, memory snapshots for forensics) or when you want to write a QEMU module for emulating your processor, or when you want to implement a sandbox for untrusted code (see NaCL, web-assembly, etc), where you want a whitelist of acceptable instructions (in order to protect from unknown kernel/driver bugs). And every hacker will tell you that the ambiguity of instruction decodings is absolutely bonkers (hey, I overflowed over a function pointer but have no w&x page.. let's jump into the middle of some benign instruction to give an entirely different meaning to the rest of the code).

In fact: How would you software decode a big binary blob and check whether it contains a forbidden instruction?

Sequentially, and letting all but one core idle?

Speculatively and parallel, throwing away 2/3 of the work?

Ah, this does not really work: You cannot throw away 2/3 of the misaligned work because you wanted to reach all possible decodings that the processor can see. Hence, your data has grown by a factor of three during "complete" decoding, and you will need to use complex and faulty heuristics to remove unneeded stuff (oh, if I jump here then I will hit a invalid instruction down the road, by symbolic execution. Hence, this probably was not the right reading frame.... Hah! it was an undocumented instruction and you just missed the crucial part of the code!).

This is not hypothetical. This hell is reality, today, behold IDA. Please don't make more of it.

And, as I said, a micro-op cache is not needed.

It is needed if I want to run your binary in QEMU on a different architecture. The very first users of a new architecture are people who want to code/debug on an emulated version, long before the first chip is produced.

And who knows whether a hardware micro-op cache will be needed in the future? By allowing jumps into the middle of instructions you will force everyone in the future to support such legacy behaviour which is effectively good for shellcode only.

Author: Agner Date: 2017-09-23 00:43
Enforcing instruction alignment at cache boundaries doesn't help much, because malicious code can jump over the boundaries. You would have to enforce alignment of all jump targets as well, which means a lot of wasteful NOPs in the code.

Another possible solution is to use the same principle as in UTF-8 code: single-byte UTF-8 codes and continuation bytes can be distinguished by the first bit. A similar principle could be implemented in ForwardCom. The first word of a multi-word instruction has the most significant bit = 1. The last word of a multi-word instruction has the most significant bit = 0. A single word instruction has the most significant bit = 0. This will make it impossible to execute stealth code by misaligning an entry. To implement this feature we need an extra bit in the first word of each multi-word instruction to replace the bit we have used in the last word. If the last word contains a 32-bit payload such as a 32-bit address or a floating point number, then all binary tools will need to have the extra complication of inserting the missing bit. This is a clumsy solution, but possible. The first word of multi-word instructions has no unused bits so we will have to sacrifice some other feature if we want to make space for this extra bit.

We have to make a cost benefit analysis to see if this feature is worth the costs. The costs of making a bit to distinguish start words from end words are: fewer bits for instruction features, clumsy and complicated binary tools, and slower emulation. The advantage is easier analysis of potentially malicious code.

I wonder it the kind of analysis you are talking about actually is an efficient weapon against malicious code. It is not easy to jump to an unauthorized address on ForwardCom without detection. All jump tables and function pointers are in read-only memory, and the call stack is separate from the data stack and isolated from everything else. The only way a malicious code could jump into an unrecognized address is by jumping to a calculated address and make this calculation difficult to trace. A scanner for malicious code should look for any jump to a calculated address and do a more thorough analysis only if such a jump occurs. This extra analysis wouldn't be so time consuming after all because disassembling is fast.

I would be more concerned about malicious code calling a system function with a calculated ID. Running in a sandbox, the code would call a benign system function, while under certain conditions or on a certain date, the malicious code would call a different system function with different parameters. An intelligent analysis tool could detect that the code uses a calculated system function ID and calculated parameters, but it may be impossible for even the best analysis tool to detect what the code can do if the calculation of system function ID uses cryptographic techniques. We cannot prevent this kind of attack by code alignment techniques and analysis of binary code. I think the best prevention against threats like these is to make restrictions on the access rights of each executable file. I have proposed the following security principle for ForwardCom: An executable file is not allowed to run unless it has been installed by a standardized installation procedure. This procedure includes the assignment of access right to the executable file, such as whether the file is allowed to access various resources.

Author: - Date: 2017-09-22 22:05

Btw: self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).

I'm not sure about strictly enforcing W^X. I get that it really messes with processor optimizations and all, but applications such as emulators which make heavy use of dynamic recompilation would suffer if they are forced to syscall every time they modify code. https://news.ycombinator.com/item?id=11789169
I think an instruction to explicitly flush the instruction cache could be a good compromise.

I haven't looked at this architecture yet, though I do wonder how cache flushing instructions could mess with security issues like cacheline based attacks.

Author: Agner Date: 2017-09-23 01:19
- wrote:

self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).

I am not too fond of the trend of JIt compilation and byte code. If you are JIT compiling each part of a program separately, you get erratic and unpredictable response times which users hate. I would prefer to compile everything once. ForwardCom is designed to be forward compatible in the sense that a compiled program can run optimally without recompilation on future microprocessors with longer vector registers. This means less need for JIT compiling. Another planned feature is re-linking of executable files. Instead of calling a DLL or shared object, you use static linking to include the necessary library functions in the executable, with the possibility to re-link later if a library function needs replacement. This makes the code run faster than if dynamic linking is used, and it makes it possible to adapt the code to different environments by replacing e.g. a user interface library.

I'm not sure about strictly enforcing W^X. I get that it really messes with processor optimizations and all, but applications such as emulators which make heavy use of dynamic recompilation would suffer if they are forced to syscall every time they modify code

JIT compilation is also a security problem. If we allow self-modifying code or code that makes code and executes it immediately, then we have all kind of security problems. I would like to re-think the whole idea of JIT compilation.

Emulation is a problem, I agree, if we don't allow dynamic compilation. An emulator running under ForwardCom and emulating something else would need special permission to do dynamic compilation and running the code it generates. The user will be warned when installing the emulator that it needs dangerous special permissions.

Author: Hubert Lamontagne Date: 2017-09-25 16:43
yeengief wrote:

Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there. Or maybe provide a separate magic memory location for this, if the stack would clash too much with calling conventions.

Desired effect: The compiler/programmer is limited by the number of addressable registers. Often, you will need to temporarily store stuff you need later. In reality, more registers exist. Allowing the processor to avoid hitting the L1-cache and using a renamed/anonymous register should give huge performance boosts in future processors with more anonymous registers running legacy code.

At the same time, this makes dependency analysis (in software and hardware) easier.

That's an interesting proposal, though I think you'd probably end up needing 2 different stacks. The C++ stack is a mosaic of data that the compiler knows that the user "can't" access (like local values of previous functions that are never accessed through a pointer, register saving, return addresses), and data that the compiler knows is "pointer accessible" (values that the code gets the address of). "Pointer accessible" data needs to be readable across threads, too (it's totally legal to have another thread operate on data on your stack, as long as you can guarantee that your function won't exit before the other thread is done). "Pointer accessible" data can't be renamed to a register - it basically has to come out of the same accesses as other memory read/writes and must preserve order and page-faulting behavior (which basically means it has to be in L1 data cache). But the "non-pointer accessible" stack could probably have its own separate memory space, which would make it possible to give it its own separate cache and read/write to it in parallel, and even do register-renaming on it as you propose.

Author: Agner Date: 2017-09-26 00:05
Hubert Lamontagne wrote:

yeengief wrote:
Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there.

I am not sure how this differs from my proposal: Each thread has two stacks. The call stack is private and separate from everything else (and usually so small that it can be cached entirely on-chip). The data stack is private to the thread that owns it. This requires some discipline from the programmer. Only static memory or memory allocated by the main thread is accessible to all threads. Or you may have a special malloc_shared function for allocating memory that is accessible to all threads. Semafors and mutexes are placed in shared memory.

The private data stack is used for function-local variables and register spilling. Since it is private, it does not need coherence across threads. This makes caching of local data simple and efficient. It is also good for safety, of course.

Legacy software needs to turn off the private stack feature if it is sharing local memory between threads. A flag in the executable file header could control this feature.

Author: Hubert Lamontagne Date: 2017-09-26 01:37
Agner wrote:

Hubert Lamontagne wrote:

yeengief wrote:

Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there.

I am not sure how this differs from my proposal: Each thread has two stacks. The call stack is private and separate from everything else (and usually so small that it can be cached entirely on-chip). The data stack is private to the thread that owns it. This requires some discipline from the programmer. Only static memory or memory allocated by the main thread is accessible to all threads. Or you may have a special malloc_shared function for allocating memory that is accessible to all threads. Semafors and mutexes are placed in shared memory.

The private data stack is used for function-local variables and register spilling. Since it is private, it does not need coherence across threads. This makes caching of local data simple and efficient. It is also good for safety, of course.

Legacy software needs to turn off the private stack feature if it is sharing local memory between threads. A flag in the executable file header could control this feature.

The difference is that I'm seeing it from the C++ compiler point of view, and how the current ecosystem of malloc+mutexes+threads works in complex applications. As it is, program values that you don't take the address of are indeed truly local, and will show up as renamable values in the SSA pass and not translate into GEP+Load+Store instructions, and can safely be moved into a non-addressable memory like a special stack (or, obviously, a register). This class of values includes register spilling, argument passing, and function return addresses - none of these are "really" visible to the programmer (except for the value), so they can also be moved to a private stack - even in existing C++ code! This doesn't even require special discipline. You can do really crazy optimizations on this "value-only" data, like reordering the reads/writes, assuming perfect alignment and no partial value writes, making page faults and reads/writes from other threads impossible, allowing register renaming and so forth, because you're protected from all the crazy cases since the user doesn't have the address and can't pass it around halfway across the program.

As soon as you take a pointer or reference, it automatically becomes visible to anything that you can pass that pointer to, which turns out to be "the whole rest of the process including other threads". This forces you to give it a real memory address and do real Loads/Stores that run in program order and are visible across all threads. And it's only after that, that you can put in your optimizer pass that tries to remove the Loads/Stores if it can prove that it doesn't change anything in program behavior - which turns out, is really hard. Even just reordering Loads/Stores is hard, because it's hard to prove that function pointers will never overlap and rule out even really dumb cases like input and output buffers to a function overlapping out of alignment.

The difference is that my proposal is based on pointer sharability:
- The main data stack for each thread is NOT thread local and is coherent across the whole process, other threads can read/write in its memory range. It is used for pointer-accessible local values.
- The call stack is local and can also contain anything the compiler KNOWS is non-pointer-accessible such as function call register saves/restores and spillovers and local arrays, and is technically isn't even in paged memory, and isn't accessed with normal load/store instructions.
- Memory allocated using malloc() is accessible from all threads (even if it's allocated by a worker thread). This is necessary for tons of existing C++ code. You can provide a private_malloc() function.

Post by **agner** » 2017-11-02, 14:30:24

Author: Agner Date: 2017-04-27 00:03
I am working on the binary utilities for ForwardCom, and I have got an idea of a new concept for function libraries.

The well-known method for static linking is used. An option to the linker specifies that certain object files or library files should be re-linkable. The necessary symbol and relocation records are then preserved in the executable file so that the library functions can later be replaced by repeating the link process.

This concept has many advantages:

The executable file is ready to run without any external dependencies. You will never have problems with a missing DLL or a wrong version of a shared object.
An important advantage of static linking is that you don't have to load a large dynamic library containing hundreds of functions when you only need a few of these functions. The re-linking feature combines this advantage with the ability to update or replace a library later or to use a third-party library, which traditionally requires dynamic linking.
If you update a DLL in Windows, you are affecting all programs that use this DLL, even if some programs work better with the old version. With re-linkable libraries, you can update a library in each program separately.
You can update or modify a program without access to the original source code.
The rarely used symbol interposition feature in shared objects is very inefficient because all accesses to public symbols go through global offset tables (GOT) and procedure linkage tables (PLT). These inefficient table lookups are not needed with the relink method
You can have different versions of your functions for different platforms. For example, the same program can have different graphical user interfaces for different environments. The appropriate version can conveniently be linked to the executable program as part of the installation procedure.
You can have specific versions of a critical piece of code optimized for different microprocessor models.
You can reconfigure a program at any time if you update the hardware or operating system.
You are not wasting time on loading and linking dynamic libraries every time a program is loaded.
Modifying or updating an executable program requires administrator privileges, for security reasons.
While this concept is intended for ForwardCom, it can also be implemented in existing systems. The only thing this requires is a new linker. Run-time loading of this kind of library is only possible in ForwardCom, though.

I have previously proposed a feature to link in libraries when a program is loaded, where the choice of library can depend on environment variables, or whatever. Do we still need this feature if we have relinking at install time? I think it is safer to relink when a program is installed than when it is loaded, because a hacker might insert a tampered-with library to be linked at load time.
If relinking requires administrator privileges then it is more convenient to do it when the program is installed. In the rare cases where a choice between different versions needs to be done at load time, it would be safer (and faster) to store different ready-made versions of the executable file.

Any comments?

Author: Hubert Lamontagne Date: 2017-04-27 17:18
Interesting. You'd still probably need some kind of dynamic linking for plugins (since a same set of interface functions can lead to multiple different plugins loaded at the same time... maybe you could do it with some kind of OS-supplied dispatcher), but it would probably cut the usage of dynamic linking a lot, yes.

Author: Agner Date: 2017-04-27 23:13
Hubert Lamontagne wrote:

You'd still probably need some kind of dynamic linking for plugins.

You are right. A plugin could in principle be statically linked into the executable, but you would probably prefer third party plugins to be loaded at runtime.

Maybe we need to mark functions and data in the executable as public or private, where only public symbols can be accessed by plugins. But a plugin, whether statically or dynamically linked, is still able to access everything that it can find the address of. This might be a security problem. Even if there is no malicious intent, a poorly behaved plugin that messes with things it shouldn't touch might cause compatibility problems.

This is also a problem in existing systems, of course, but I would like ForwardCom to be safer. Maybe we need a system where each plugin is running as a separate thread or process where we can control its access rights?

Author: Hubert Lamontagne Date: 2017-05-10 16:43
Agner wrote:

Maybe we need a system where each plugin is running as a separate thread or process where we can control its access rights?

I think that class of problem tend to be already solvable using IPC (inter process communication)... And it wouldn't address some of the use cases I've seen:

- Java-C++ interfacing is done with DLLs on windows, and presumably programs do tons of small function calls back and forth, and do stuff like sharing large arrays in memory. The first generation of Android apps have a ton of this stuff going on (to interface between C++ code and Android's Java APIs), and getting the thread-to-thread overhead every time you read a button or something like that would've been a problem.
- Some script languages also use dynamic linking to interface with C++ (like, I think Python does this), and they will do tons and tons of small function calls.
- Audio processing plugins need to run in the same thread as the caller because they use tiny buffers (like, 1ms tiny - which means that the whole processing of a block of audio must take less than 0.001s before sending to the audio hardware every single time). If the OS's scheduler gets involved in any way you'll have dropouts for sure.
- Some game engines allow for plugins to extend features, which might require sharing access to very large data structures like level data in memory.

In a way, this reminds me of the whole micro-kernel thing, which had the same issue but between the applications and system modules (applications don't want to wait after the scheduler to call system module functions!).

Author: Agner Date: 2017-05-11 01:07
Good examples, Hubert.

I think the optimal solution for Java, C#, and script languages is just-in-time compiling. Actually, I don't understand why there are so many slow applications out there that don't use JIT compiling

The code only has to be compiled the first time it is run. Or - why not distribute it in compiled form for ForwardCom? The compiled code is guaranteed to be forward compatible with future ForwardCom processors.

The compiled code can easily be linked to C/C++ code in the JIT process. The calling conventions of ForwardCom are standardized to work across different programming languages.

You are probably right that interprocess communication is a good solution for plug ins if you don't want the plug in to have access to mess with everything in the mother program.

Author: Hubert Lamontagne Date: 2017-05-12 17:34
Agner wrote:

Good examples, Hubert.

I think the optimal solution for Java, C#, and script languages is just-in-time compiling. Actually, I don't understand why there are so many slow applications out there that don't use JIT compiling

The code only has to be compiled the first time it is run. Or - why not distribute it in compiled form for ForwardCom? The compiled code is guaranteed to be forward compatible with future ForwardCom processors.

The compiled code can easily be linked to C/C++ code in the JIT process. The calling conventions of ForwardCom are standardized to work across different programming languages.

You are probably right that interprocess communication is a good solution for plug ins if you don't want the plug in to have access to mess with everything in the mother program.

To load in C++ objects, the Java code calls System.loadLibrary("name"), which on windows loads "name.dll" and on posix loads "libname.so", so you still need either dynamic linking or arbitrary on-the-fly recompilation phases on already running apps, because your Java code is already running when you learn that you have to load this library. Another similar mechanism in Java is ClassLoader, which dynamically loads in more Java code into an already running app from arbitrary jar files on the disk (it's basically the Java equivalent of loading DLLs).

Author: Agner Date: 2017-05-13 00:58
Hubert Lamontagne wrote:

To load in C++ objects, the Java code calls System.loadLibrary("name")

ForwardCom also supports runtime dynamic linking. A highly optimizing JIT compiler might replace this by static linking.

forwardcom forum

The original proposal, 2 years ago

Introduction website

Proposal for instruction set - now on Github

ARM with scalable vector extensions

Proposal for instruction set - now on Github

Paging

A null register?

Indexed registers

Bilinear Interpolation

ForwardCom version 1.04

Async system calls; horizontal packing instruction

Comparison of instruction sets

ForwardCom version 1.05

Syscall/ISR acceleration

Jump prefetch?

High precision arithmetic

Intel's Control-flow Enforcement Technology

Assembler with metaprogramming features

Number of register file ports in implementations

Proposal for instruction set - now on Github

New idea: Re-linkable libraries in ForwardCom