The original proposal, 2 years ago

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Whole-function vectorization and conditionals

Post by agner » Thu Nov 02, 2017 12:06 pm

Author: Sylvain Collange Date: 2016-08-15 04:00
Agner wrote:
My reason for using the same registers for floating point scalars and floating point vectors is that floating point code often contains calls to mathematical function libraries. Such a library can use variable-length vectors for parameters and results. The same library function can be called with a scalar or a vector because a scalar is simply handled as a vector with one element. This will make it easy for the compiler to vectorize the code. ForwardCom has a separate register set for integer scalars. These are useful for pointers, loop counters, array indexes, etc. that don't need vectors, and they are easier to save and restore than vector registers.
This looks quite elegant. The vector length is passed implicitly to the caller without any change to the ABI.

How do you handle conditionals, though? For instance:

Code: Select all

for (int i = 0; i < N; i++) {
    if(a(i) != 0)
        b(i) = sin(a(i));
}
Does the ABI define a register as the "current mask" (GPU-like)? Or should the caller compact input vectors to have only contiguous elements, and expand the output vectors (vector processor-like)?


Author: Agner Date: 2016-08-15 06:17
Sylvain Collange wrote:
Does the ABI define a register as the "current mask" (GPU-like)? Or should the caller compact input vectors to have only contiguous elements, and expand the output vectors (vector processor-like)?
Vector register r1-r7 can be used as masks. The normal solution for your example would be to calculate the sin function on the whole a vector and then merge the result, using a mask for (a(i) != 0).

If you want to mask off some elements throughout the calculation of the sin function then the library function would need an extra parameter for the mask. (Some Intel math libraries actually have this). You wouldn't save any time by masking off elements during the calculation. The only reason I can see to have a mask on library functions is to avoid interrupts (traps) if floating point exceptions are enabled. The sin function wouldn't generate exceptions, but the logarithm function would in your case. The recommendation is to detect floating point errors as NAN's and INF's rather than using exceptions, because exceptions depend on the vector length. If you replace sin by log in your example then the log function would generate INF or NAN for non-positive values, and these values will be removed by the mask operation after the function call. If you insist on having floating point exceptions turned on, for whatever reason, then you might want a version of the log function with a mask or you could change the non-positive values before the call to the log function.

Compacting the vector, as you propose, might be useful if the vector is very sparse (only few of the elements are used) and you could compact several vectors into one before calling the function. However, compacting vector elements takes time so this would be useful only for very time-consuming functions.


Author: Sylvain Collange Date: 2016-08-15 07:35
Thanks. I was considering the more general case of vectorizing arbitrary functions, including functions with side effects. There is no reason to restrict function-level vectorization to a few hand-written math functions, is it?
As the compiler does not know in general whether a particular function has side effects or can raise exceptions, my point is that the mask could be part of the ABI, just as vector length (implicitly) is. This would enable vectorization-agnostic composition of function calls.

By the way, have you considered the micro-architecture and compiler implications of using a subset of vector registers as masks? It seems to introduce quite a lot of complexity:

- Instructions take up 5 input operands from vector registers (RS, RT, RU, Mask, RD). The old register value RD is required to merge the result.
Supporting 5-input instructions in a wide out-of-order superscalar is out of question with the current state of technology: executing 3 masked vectors instructions/clock would require 15 read ports in the SIMD register file, plus the ports needed for the load/store unit. Worse, making the operation dependent on the former value of the architectural destination register defeats the purpose of register renaming: write-after-read dependencies are still there.
A more acceptable implementation would break down each masked instruction into 2 µops: first a computation on the whole vector, then a merge (I believe that is how Knights Landing and Skylake handle masked AVX-512 instructions). This has a serious performance impact, and still does not completely eliminate write-after-read dependencies.

- The dependency with the old destination value and the merging operation could be avoided when unused lanes are set to 0 instead of the old value. Unfortunately, that information is encoded in the contents of the mask register itself, so the hardware does not know it early enough.

- More generally, masks encode control flow information, while vector registers encode data flow information. They are not needed at the same stages in the pipeline. Execution logic does not need mask (except as an optimization for saving power), and control logic does not need data. Encoding control flow information and data flow information in the same registers makes decoupling these stages much harder.

- From the compiler perspective, having some instructions that can only access a subset of registers makes register allocation much harder.


Author: Agner Date: 2016-08-15 10:53
Sylvain Collange wrote:
my point is that the mask could be part of the ABI, just as vector length (implicitly) is. This would enable vectorization-agnostic composition of function calls.
A function may have an unlimited number of input parameters and an unlimited number of outputs (through pointers or references or class members or whatever, all of which could in principle have different masks. That would add a lot of complexity to the ABI for something that is rarely needed. I prefer to specify a mask as an explicit function parameter in the rare cases that it is needed.
By the way, have you considered the micro-architecture and compiler implications of using a subset of vector registers as masks?
A lot of instructions in current instruction sets are using certain registers for specific purposes. I don't think this is a big problem. It is a much bigger problem to have a separate set of registers for masks only, as AVX512 has, in my opinion.
Instructions take up 5 input operands from vector registers (RS, RT, RU, Mask, RD). The old register value RD is required to merge the result.
No, the first source operand is required to merge the result when masking. This happens to be the same as RD only if using an instruction format with too few register operands. This gives a maximum of 5 input dependencies. These are not required at the same stage in the pipeline. The execution stage has typically 2 or 3 input operands. A few unimportant optional instructions have 4 inputs. I added these instructions mainly for testing whether it is feasible to have four input operands.
The address calculation stage in the pipeline can have up to three input dependencies: base pointer, index and vector length (or block size). The architecture does not specify which stage in the pipeline reads the mask register. It could be somewhere between the address calculation stage and the execution stage if the number of register read ports is critical.

So, the total maximum number of input dependencies is five, but the maximum number of input dependencies at one stage in the pipeline can be limited to three if necessary. This is still a lot of complexity to handle for the scheduler, I agree. I have limited the number of output dependencies to one for the same reason. This limitation creates problems for add-with-carry and for integer overflow detection, but people have convinced me that it was more important to limit the number of output dependencies. x86 has two output dependencies on most instructions, including the flags and the floating point control/status registers.

Intel processors have traditionally had a limitation of two input dependencies for each micro-operation, but this has been increased to three in recent versions. AMD have never had any limitation on input dependencies. I don't know how expensive it is to have many read ports, but I prefer to have many read ports rather than splitting instructions into micro-operations. My idea is that a ForwardCom processor should not need to split instructions into micro-ops.
making the operation dependent on the former value of the architectural destination register defeats the purpose of register renaming: write-after-read dependencies are still there.
All superscalar processors handle this by renaming the register so that the input and output use two different physical registers.
A more acceptable implementation would break down each masked instruction into 2 µops: first a computation on the whole vector, then a merge (I believe that is how Knights Landing and Skylake handle masked AVX-512 instructions).
I have not been able to get my hands on any AVX-512 processor yet, so I am not able to test it. If you have any information on the microarchitecture of these processors then please give me a link.
More generally, masks encode control flow information, while vector registers encode data flow information.
The masking mechanism translates control flow into data flow. It does not change the dependency of the other input and output operands, as I explained above. If all elements of a vector are masked off then the hardware will still treat the input operand as a dependency and wait for it. You may want to convert this into a branch if the prediction rate is good.


Author: Sylvain Collange Date: 2016-08-15 14:05
Agner wrote:
A function may have an unlimited number of input parameters and an unlimited number of outputs (through pointers or references or class members or whatever, all of which could in principle have different masks. That would add a lot of complexity to the ABI for something that is rarely needed. I prefer to specify a mask as an explicit function parameter in the rare cases that it is needed.
I do not see in which situation a vectorized program would need to pass multiple masks to a function. If we vectorize a loop with a function call inside, for each iteration i, either the function is called (bit mask 1) or not called (bit mask 0). Masks originate from the control flow of the original program. There is only one control flow regardless of the number of function parameters. So a single mask register is sufficient.
So, the total maximum number of input dependencies is five, but the maximum number of input dependencies at one stage in the pipeline can be limited to three if necessary.
The pipeline stage does not make a difference. If you want to sustain an execution rate of three 5-input instructions per clock cycle, you need the bandwidth to read 15 inputs per cycle on average from the register file. It does not matter in which order and at which pipeline stage they are read.

I agree that limiting the number of outputs is even more important. (Register file area scales with the square of the number of ports and power scales with its cube. But for read ports, you can always follow the brute-force approach of duplicating the register file and get a linear scaling.)
Yes, x86 instructions also output flags, but x86 CPUs use tricks to turn them into single-output instructions by having slightly wider physical registers, storing the flags in the physical destination register, and renaming the architectural flags register to the physical destination of the last instruction issued.

But input ports are also a bottleneck. Specializing registers allow to partition this complexity. Most ISAs use different registers for integer and FP, for instance.
making the operation dependent on the former value of the architectural destination register defeats the purpose of register renaming: write-after-read dependencies are still there.
All superscalar processors handle this by renaming the register so that the input and output use two different physical registers.
Let's take an example:
Non-masked code

mul_add.d v0, v1, v3, v4
mul_add.d v0, v5, v6, v7

Renaming maps each v0 operand to different physical registers. Assuming all inputs are ready and we have 2 FMA units with latency of 5 cycles, both instructions run in parallel and the code takes 5 cycles. So far so good.

Masked code

mul_add.d v0, v1, v3, v4, mask=v8
mul_add.d v0, v5, v6, v7, mask=v9

Renaming still maps v0 to different physical registers p0 and p1, but the second FMA now has a data dependency on the older value of v0, which is p0. Assuming a by-the-book OoO scheduler, instructions will be considered as dependent and the result will be only available after 10 cycles.

Now with the 2 µop option, the code becomes after expansion and renaming:

mul_add.d p0, v1, v3, v4
merge.d p1, v0, p0, mask=v8
mul_add.d p2, v5, v6, v7
merge.d p3, p1, p2, mask=v9

Both FMAs can start immediately and produce their results after 5 cycles. Then the first merge executes in say, 1 cycle, then the second dependent merge. Total latency is 7 cycles.
In any case, renaming was unable to completely reorder the two instructions.

I agree you could have a very smart scheduler that schedules instructions just 5 cycles before the mask becomes available so it is ready just in time. But it would add a lot of complexity in a critical part of the pipeline. Better break down the work into enough simple µops and let the out-of-order engine do the heavy lifting. Then you can spend resources you just saved into wider issue width and higher clocks.
I have not been able to get my hands on any AVX-512 processor yet, so I am not able to test it. If you have any information on the microarchitecture of these processors then please give me a link.
At some point Intel had the Knights Landing software optimization guide on their website, but it seems they took it out. :(
About masks, they say:
AVX-512 permits instructions to use masking. The best performance usually happens for instructions that do not use masking. Instructions that mask with zeroing usually have similar performance. Instructions that mask with merging can occasionally perform much slower, since there is an additional dependency introduced to the instruction. Keep in mind performance considerations when selecting the masking mode.
Their formulation suggests they follow the single-instruction approach with naive scheduling. They do not appear to split in two µops after all. But if they write "much slower" they probably mean it...

The takeaway is that the masking mode (zeroing or merging) should be available as early as possible in the pipeline, ideally at decode just by looking at the instruction. And the programmer/compiler should use zeroing whenever possible.


Author: Agner Date: 2016-08-15 23:46
Sylvain Collange wrote:
Agner wrote:
So, the total maximum number of input dependencies is five, but the maximum number of input dependencies at one stage in the pipeline can be limited to three if necessary.
The pipeline stage does not make a difference. If you want to sustain an execution rate of three 5-input instructions per clock cycle, you need the bandwidth to read 15 inputs per cycle on average from the register file. It does not matter in which order and at which pipeline stage they are read.
Could you have two separate sets of regiser read ports, one that is wired to the address calculation stage in the pipeline (general purpose registers only), and one that is wired to the execution stage?
Yes, x86 instructions also output flags, but x86 CPUs use tricks to turn them into single-output instructions by having slightly wider physical registers, storing the flags in the physical destination register, and renaming the architectural flags register to the physical destination of the last instruction issued.
Interesting. I wonder where you have all this information from?
Masked code

mul_add.d v0, v1, v3, v4, mask=v8
mul_add.d v0, v5, v6, v7, mask=v9

Renaming still maps v0 to different physical registers p0 and p1, but the second FMA now has a data dependency on the older value of v0, which is p0.
No. Without zeroing, the mask chooses between the result and the first input operand, which is not the destination operand.

mul_add.d v0, v1, v3, v4, mask=v5


is the equivalent of:

Code: Select all

for (i=0; i < vectorlength; i++) 
    v0([i] = v5[i] ? v1[i] + v3[i]*v4[i]  :  v1[i];

AVX-512 permits instructions to use masking. The best performance usually happens for instructions that do not use masking. Instructions that mask with zeroing usually have similar performance. Instructions that mask with merging can occasionally perform much slower, since there is an additional dependency introduced to the instruction. Keep in mind performance considerations when selecting the masking mode.
Their formulation suggests they follow the single-instruction approach with naive scheduling. They do not appear to split in two µops after all. But if they write "much slower" they probably mean it...

The takeaway is that the masking mode (zeroing or merging) should be available as early as possible in the pipeline, ideally at decode just by looking at the instruction. And the programmer/compiler should use zeroing whenever possible.
You may as well interpret "much slower" as a consequence of the extra dependency which it has to wait for. I actually had the masking mode (zeroing or merging) coded in a bit in the instruction in an early draft of ForwardCom, but I needed this bit for other purposes and I was able to put it into the mask register because it doesn't change dependencies when merging with the first operand rather than with the destination.


Author: Agner Date: 2016-08-16 03:11
Actually, we could limit the maximum number of input dependencies to four if we make the rule that instructions with three input operands including a memory operand cannot have a mask. That would not be a serious limitation.


Author: Sylvain Collange Date: 2016-08-16 10:30
Correction: I meant WAW dependencies in my previous messages, not WAR dependencies.

Agner wrote:
Could you have two separate sets of regiser read ports, one that is wired to the address calculation stage in the pipeline (general purpose registers only), and one that is wired to the execution stage?
If both sets of ports access the same physical array of registers, then the RF component still needs m+n ports, regardless of where its ports are connected to.
If you replicate the array itself to make two copies of the register file, then yes, it works: you get twice as many read ports for about twice the area. It does not work for write ports, though: you still need to perform all writebacks to both register files to keep their data coherent. Then another complication is the bypass network, which does not scale well with the number of ports either.

You may want to take advantage of the fact that mask registers are a subset of general-purpose registers to replicate only part of the array. It would work on an in-order processor, but as soon as you rename registers this property does not hold any more.
Interesting. I wonder where you have all this information from?
Unfortunately, academic papers seldom go down to that level of detail, so you have to microbenchmark, speculate, and read patents.
For register renaming, here are the classic ones from the P6/K7 era:
https://www.google.com/patents/US6047369
https://www.google.com/patents/US5632023
Even on these 20 year-old architectures, you can see the renaming techniques are quite involved. (Thanks to the infamous INC/DEC instructions that only update a subset of the Flags register...)
No. Without zeroing, the mask chooses between the result and the first input operand, which is not the destination operand.
Now I get it, thanks! Indeed, you need no extra input since you will be reading this register anyway.
I was expecting the semantics of masked instructions to be "pretend nothing happens on lanes with a 0 bit mask". In your ISA even 0-masked lanes have the side effect of a register move. This is unusual, but why not... Wouldn't that complicate register allocation, though? Do you have code examples showing how a compiler can perform if-conversion from some arbitrary code?

Also are there instruction encodings to chose between the result and the second operand (e.g. v0 = v5 ? v1 + v3 * v4 : v3; for a conditional Horner step) ?


Author: Agner Date: 2016-08-17 07:33
Sylvain Collange wrote:
Correction: I meant WAW dependencies in my previous messages, not WAR dependencies.
Agner wrote:
Could you have two separate sets of regiser read ports, one that is wired to the address calculation stage in the pipeline (general purpose registers only), and one that is wired to the execution stage?
If both sets of ports access the same physical array of registers, then the RF component still needs m+n ports, regardless of where its ports are connected to.
If you replicate the array itself to make two copies of the register file, then yes, it works.
I don't want to duplicate the register file, only avoid the quadratic increase you are talking about.
Then another complication is the bypass network, which does not scale well with the number of ports either.
Can you bypass a result to multiple in-flight instructions using the same bus? If so, you need only few bypass buses.
Interesting. I wonder where you have all this information from?
Unfortunately, academic papers seldom go down to that level of detail, so you have to microbenchmark, speculate, and read patents.
Do you have any references on how the register read ports are designed and why they are so expensive?
Even on these 20 year-old architectures, you can see the renaming techniques are quite involved. (Thanks to the infamous INC/DEC instructions that only update a subset of the Flags register...)
This is exactly the kind of cumbersome heritage that made me want to design a completely new instruction set. ForwardCom never modifies a part of a register only.
I was expecting the semantics of masked instructions to be "pretend nothing happens on lanes with a 0 bit mask".
That was never my plan. The mask value is not available at the time the register renamer assigns the output.
Do you have code examples showing how a compiler can perform if-conversion from some arbitrary code?
A branch in a vectorized loop is translated to a mask. All the good compilers can do this today.
Also are there instruction encodings to chose between the result and the second operand (e.g. v0 = v5 ? v1 + v3 * v4 : v3; for a conditional Horner step) ?
That was not my plan, but of course such an instruction could be defined if there is a need for it. But why would you have a conditional inside a polynomial?

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Merging with first operand

Post by agner » Thu Nov 02, 2017 12:20 pm

Author: Sylvain Collange Date: 2016-08-18 10:20
Agner wrote:
Can you bypass a result to multiple in-flight instructions using the same bus? If so, you need only few bypass buses.
Yes, that is essentially how it is done (or at least how it is described in textbooks). But the load on the bus increases with the number of potential consumers, and so do area, delay and power.
There are still a lot of tricks hardware designers can use on a given technology, but the general trends hold anyway. And as far as instruction set design is concerned, general trends are what we care about.
Do you have any references on how the register read ports are designed and why they are so expensive?
Rixner et al., Register Organization for Media Processing, HPCA 2000.
http://www.ai.mit.edu/projects/aries/pa ... gister.pdf
evaluates the tradeoffs using analytical models and considers various distributed and SIMD RF organizations.
Do you have code examples showing how a compiler can perform if-conversion from some arbitrary code?
A branch in a vectorized loop is translated to a mask. All the good compilers can do this today.
I was asking specifically in the context of ForwardCom and the choice of the first operand as the masked source.
e.g. Which code is generated when you vectorize :

Code: Select all

for all i {
  int x = a[i];
  int y = b[i];
  if(x > y) {
    y = x + 1;
  }
  c[i] = y;
  d[i] = x;
}
The translation in a classic predicated ISA is straighforward, but what about in ForwardCom?
I have the impression that the first operand of the masked instructions will end up being the same as the destination anyway, and that we need an extra register move.
That was not my plan, but of course such an instruction could be defined if there is a need for it. But why would you have a conditional inside a polynomial?
I do not know, but the question is what a compiler will emit when vectorizing such a program if the instruction does not exist? (And the compiler has to assume that arithmetic exceptions might be enabled.)


Author: Agner Date: 2016-08-19 12:00
Sylvain Collange wrote:
Rixner et al., Register Organization for Media Processing, HPCA 2000.
www.ai.mit.edu/projects/aries/papers/im ... gister.pdf
evaluates the tradeoffs using analytical models and considers various distributed and SIMD RF organizations.
Thanks for the reference
Which code is generated when you vectorize :

for all i {


This is straightforward: A compare instruction to generate the mask, an addition instruction, and a masked move instruction to conditionally replace y by x+1.
(And the compiler has to assume that arithmetic exceptions might be enabled.)
If you have enabled exceptions for integer overflow and you want to avoid the hypothetical overflow exception in a not-taken branch then you have to put the mask on the addition instuction as well. In other words, you will have two masked instructions using the same mask.

This is the reason why exceptions are not good in a vector processor. In floating point code you can disable exceptions and rely on the propagation of NAN and INF values to detect numeric errors. But there is no easy way of detecting integer overflow. I have discussed this in a previous post, but there seems to be no good solution. Many compilers and programming languages simply have no easy way of detecting integer overflow, but this may be a security problem, so I thought that we must provide some method. The following methods for detecting integer overflow may be considered:
  • Use a flag output. This gives every ALU instruction an extra output dependency, and also an extra input dependency if you want the overflow bit to be propagated through a series of calculations. The extra input and output dependencies hamper out-of-order processing.
  • Add an extra overflow bit for each vector element in a vector register. This adds an extra complication for saving the overflow bits when a vector register is spilled to memory.
  • Use exception traps (interrupts) to catch integer overflow. This requires a minimum of complexity to the instruction set, but it has a problem with vector code.
  • Use the even-numbered elements in a vector register for calculation and the odd-numbered elements for propagating overflow information. This is relatively easy to implement but wasteful because only half of the ALUs are used.
Method 1 has been rejected. The other three methods are still being considered for use in ForwardCom.


Author: Sylvain Collange Date: 2016-08-19 15:37
Agner wrote:
The following methods for detecting integer overflow may be considered:
  • Use a flag output. This gives every ALU instruction an extra output dependency, and also an extra input dependency if you want the overflow bit to be propagated through a series of calculations. The extra input and output dependencies hamper out-of-order processing.
  • Add an extra overflow bit for each vector element in a vector register. This adds an extra complication for saving the overflow bits when a vector register is spilled to memory.
  • Use exception traps (interrupts) to catch integer overflow. This requires a minimum of complexity to the instruction set, but it has a problem with vector code.
  • Use the even-numbered elements in a vector register for calculation and the odd-numbered elements for propagating overflow information. This is relatively easy to implement but wasteful because only half of the ALUs are used.
Method 1 has been rejected. The other three methods are still being considered for use in ForwardCom.
What about:
5. Same as 3, and use "traditional" masking on all SIMD instructions.
In traditional masking, an SIMD instruction working on vectors of n operands takes an n-bit mask. For each lane i: If bit i is set, the operation is performed normally on lane i (including triggering exceptions as needed). If bit i is unset, nothing happens on lane i (including silencing exceptions). If multiple exceptions happen in the same vector, you may either report them all at once by exposing an exception mask, or report them one by one in a well-defined order.

AVX-512 as well as AMD and Nvidia GPUs work that way, and so do most SIMD machines since the ILLIAC-IV, 45 years ago. The only exceptions were the MMX/SSE/AVX-style short-vector instruction sets, which were not "real" SIMD (at least not until AVX-512). There is no problem with exceptions in SIMD code, as long as per-lane masking is properly defined.

There is some overhead for handling such masking in an out-of-order processor as we discussed, but zeroing can avoid this overhead in many cases.


Author: Agner Date: 2016-08-20 00:21
Thank you for your inventiveness.
Sylvain Collange wrote:

What about:
5. Same as 3, and use "traditional" masking on all SIMD instructions.
In traditional masking, an SIMD instruction working on vectors of n operands takes an n-bit mask. For each lane i: If bit i is set, the operation is performed normally on lane i (including triggering exceptions as needed). If bit i is unset, nothing happens on lane i (including silencing exceptions). If multiple exceptions happen in the same vector, you may either report them all at once by exposing an exception mask, or report them one by one in a well-defined order.
I consider this as implied in (3) because the mask in ForwardCom suppresses all error conditions for disabled elements. Maybe I didn't express this clearly enough. It is still a problem that the behavior depends on the vector length. If a loop generates two exceptions then these exceptions may happen simultaneously if vectors are long, but separately if the same code is run on another processor with shorter vectors. It is difficult to construct the exception handler so that it is guaranteed to generate the same result regardless of the vector length of the microprocessor.

Hubert Lamontagne wrote:
I can think of a couple more options:

Method 5 : Have an extra instruction that applies the operation but generates potential-overflow-or-exception flags instead of result values. This is equivalent to option 1 but using 2 instructions instead of one (but is roughly the same cost in the pipeline since you need 2 micro-ops in both case to store all the results), or it's like option 4 but without having the two results interlaced together.
Then you have a problem if you want to propagate the overflow bit. You could either check the overflow after each instruction at the cost of N branch instruction in a calculation with N steps; Use OR instructions to join the overflow flags at the cost of N OR-instructions; or have an extra input for the previous flag on each overflow check instruction at the cost of having more input dependencies on the instructions and another dependency chain making out-of-order execution more difficult. Anyway, the throughput will be 1/3 of the throughput without overflow checking, where method (4) gives 1/2 throughput.
Method 6 : Have separate vector flags registers. This is like method 1 but instead of having the extra inputs and outputs go to the same register file (which requires more ports and thus more micro-ops), they go to a separate register file that only deals with flags which means that they can have their own register renamer and have their own ports, which avoids the N^2 growth associated with adding more ports.
This still requires an extra input dependency and an extra output dependency on each instruction. I think this would be quite costly for the out-of-order scheduler.
Method 7 : Do it all in software. This makes checked operations take multiple instructions which might be slower. But it reduces the overall complexity of the design which means that you might be able to run 1 or 2 more instructions per cycle, which probably offsets the extra cost of separate checking. This can probably be combined with option 5 (just have a couple extra instructions for speeding up overflow checking for the cases that really need it).
I am not sure I understand how differs from your method 5.
Very rough cost-benefit analysis:
Yes, this shows that all the methods are costly for the hardware as well as for the compiler. I don't want to split instructions into multiple micro-operations.

Perhaps the choice of method depends on the purpose of overflow detection:
  • If the purpose is to abort the application in case of overflow for security reasons then method (3) might be the easiest solution.
  • If the purpose is to recover from an overflow condition by re-calculating with bigger integers then method (3) might also be acceptable if overflows are rare.
  • If the purpose is to detect the carry output of unsigned addition then we may perhaps prefer method (4).
  • If the purpose is some kind of advanced compensation for signed overflow then I don't know what method is best.

Author: Hubert Lamontagne Date: 2016-08-20 18:49
Agner wrote:
Perhaps the choice of method depends on the purpose of overflow detection:
  • If the purpose is to abort the application in case of overflow for security reasons then method (3) might be the easiest solution.

This is why the add instruction triggers a fault on... IBM Z-system I *THINK*. MIPS has a separate specialized instruction for this as well. It's the same reason for 0-divide exceptions, it's used in applications where aborting failing applications is a good thing in general (like server software talking to a database where the earlier you shut it down, the less damage you inflict...). This is not too popular among end-user facing applications like video games and stuff like photoshop, because in that case an application shutdown is already the worst thing that can happen.

Agner wrote:
If the purpose is to recover from an overflow condition by re-calculating with bigger integers then method (3) might also be acceptable if overflows are rare.
If I'm not mistaken, that's how LISP did it. The catch with this one is that it requires dynamic typing to work. If you're running a language with dynamic typing, the chances that you're going to be able to run any vector code at all are low. Even for scalar code, dynamic typed languages typically have some other much worse bottlenecks than integer computation (multiple levels of indirection to get to data in Python, for instance). For something like C++ code, IMHO it's just way easier and faster to supersize all your integers to 64bits from the start than to deal with any sort of size change.

Agner wrote:
If the purpose is to detect the carry output of unsigned addition then we may perhaps prefer method (4).
This happens in one very precise case: BIGNUM computation. How important is BIGNUM for you? My impression on this is that BIGNUM falls kinda low on priority lists for architectures like ARM. If you design your architecture for BIGNUM, you can probably make it much more efficient for this kind of stuff than a generic architecture like MIPS. I guess it comes down to what you're targetting.

Agner wrote:
If the purpose is some kind of advanced compensation for signed overflow then I don't know what method is best.
My experience on that is that if you're running into signed overflow problems, the easiest way is simply to switch to floating point or supersize your integers to 64bits.


Author: Sylvain Collange Date: 2016-08-25 07:43
Agner wrote:
I consider this as implied in (3) because the mask in ForwardCom suppresses all error conditions for disabled elements. Maybe I didn't express this clearly enough. It is still a problem that the behavior depends on the vector length. If a loop generates two exceptions then these exceptions may happen simultaneously if vectors are long, but separately if the same code is run on another processor with shorter vectors. It is difficult to construct the exception handler so that it is guaranteed to generate the same result regardless of the vector length of the microprocessor.
I see. I agree that there are scenarios in which it is important to signal exceptions in the same order as they would occur in the original sequential program. e.g. when speculatively vectorizing a while loop with a data-dependent exit condition.
There are still solutions that are not /that/ difficult:

Option 1. In the exception handler, find out which is the first lane that produced an exception, named i (I assume the ISA gives this information somehow).
- If i=0, report the exception.
- Otherwise, remember the exception, and change the vector length for the current iteration to i-1 (either using VL or a mask register).
At the end of each iteration, check whether there is a pending exception. If there is one, report it. Otherwise, just reset the vector length and continue.

Option 2. Same approach, using hardware instead of software. The "pending exception" becomes a flag register, with 1 flag per lane. The current vector length or mask is automatically reduced as some lanes record exceptions. A new 'commit' instruction is called at the end of each iteration. It raises the recorded exception that would be the first in sequential order.

In practice, I do not expect this to be often necessary, as a speculative vectorizer will need to restore a snapshot to undo all unwanted side effects and restart the loop whenever an exception occurs anyway, regardless on which lane it occurs.


Author: Hubert Lamontagne Date: 2016-08-19 17:52
Agner wrote:
If you have enabled exceptions for integer overflow and you want to avoid the hypothetical overflow exception in a not-taken branch then you have to put the mask on the addition instuction as well. In other words, you will have two masked instructions using the same mask.

This is the reason why exceptions are not good in a vector processor. In floating point code you can disable exceptions and rely on the propagation of NAN and INF values to detect numeric errors. But there is no easy way of detecting integer overflow. I have discussed this in a previous post, but there seems to be no good solution. Many compilers and programming languages simply have no easy way of detecting integer overflow, but this may be a security problem, so I thought that we must provide some method. The following methods for detecting integer overflow may be considered:
  • Use a flag output. This gives every ALU instruction an extra output dependency, and also an extra input dependency if you want the overflow bit to be propagated through a series of calculations. The extra input and output dependencies hamper out-of-order processing.
  • Add an extra overflow bit for each vector element in a vector register. This adds an extra complication for saving the overflow bits when a vector register is spilled to memory.
  • Use exception traps (interrupts) to catch integer overflow. This requires a minimum of complexity to the instruction set, but it has a problem with vector code.
  • Use the even-numbered elements in a vector register for calculation and the odd-numbered elements for propagating overflow information. This is relatively easy to implement but wasteful because only half of the ALUs are used.
Method 1 has been rejected. The other three methods are still being considered for use in ForwardCom.
I can think of a couple more options:

Method 5 : Have an extra instruction that applies the operation but generates potential-overflow-or-exception flags instead of result values. This is equivalent to option 1 but using 2 instructions instead of one (but is roughly the same cost in the pipeline since you need 2 micro-ops in both case to store all the results), or it's like option 4 but without having the two results interlaced together.

Method 6 : Have separate vector flags registers. This is like method 1 but instead of having the extra inputs and outputs go to the same register file (which requires more ports and thus more micro-ops), they go to a separate register file that only deals with flags which means that they can have their own register renamer and have their own ports, which avoids the N^2 growth associated with adding more ports. Plus, you can mandate that the whole flags register is always updated to avoid x86 flags register merging insanity. And you can probably have a limit of 1 op per cycle on the number of flags operations to limit the size of the extra flags manipulation block you'd be adding to the CPU (see how ARM did this: most arithmetic ops never touch the flags so it needs much less aggressive handling of flags). If you're doing heavy duty flags register operations, this might be worth the trouble - which is why 8bit and 16bit CPUs all have flags (they need it to deal with oversized integers).

Method 7 : Do it all in software. This makes checked operations take multiple instructions which might be slower. But it reduces the overall complexity of the design which means that you might be able to run 1 or 2 more instructions per cycle, which probably offsets the extra cost of separate checking. This can probably be combined with option 5 (just have a couple extra instructions for speeding up overflow checking for the cases that really need it).

Very rough cost-benefit analysis:
  • Requires 2 micro-ops. Note that this is fine if you have to support 2 micro-op operations for other reasons.
  • Effects depends on the ABI. If you mandate that callees don't have to save the flags and can trash them, then you only need special save/restore in interrupt handlers.
  • Depends on if the design is in-order or out-of-order. For in-order, this makes an Arm NEON style delayed-SIMD pipeline impossible because all downstream instructions become conditional. On out-of-order, this adds an extra path from the SIMD unit to the retirement unit to signal potential interrupts (plus more complex conditional writeback of other operations retiring on the same cycle).
  • Potentially adds more shuffling of values around to add and remove space for flags. Plus, you're potentially getting half as much vector processing for a given register size.
  • Adds more ALU instructions to deal with. Added instructions are kinda quirky.
  • Requires addition of full vector flags handling unit, some instructions to load/store flag reg values, potential stalls if too many flags instructions are issued at the same time.
  • Error checked code that isn't limited by memory or branch prediction might run slower, but non-error-checked code might run faster. Simpler architecture is easier to implement and verify.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for instruction set - now on Github

Post by agner » Thu Nov 02, 2017 12:25 pm

Author: Joe Duarte Date: 2016-08-17 23:46
Agner wrote:
Joe Duarte wrote:
I think it would be interesting if applications could set a few custom constants at start time. A while back I asked about having useful constants hardwired in the CPU (e.g. π), and you guys explained why it probably wouldn't be worth it. I think letting each application tell the CPU to save a few constants that the application will often use might be more useful. The point is to be able to invoke the constant without an immediate, but just with a much smaller shortcode (two or three bits). It's kind of like an application-specific cache
ForwardCom can have constants embedded in instructions. This includes integer constants of all sizes and floating point constants of half, single and optionally double precision (requires 1, 2 and 3 word instruction size, respectively). Your idea would require an extra register set for saving and restoring data that will be needed later. This extra register set would just require two extra instructions for copying data from a normal register to an extra register and vice versa. This could be useful for many purposes and it would reduce the pressure on the data cache. The only problem I can see is that it will be an extra complication to the function calling convention because we need to know whether a function modifies the extra registers or not.
Agner, by the way, I just learned that my idea for small cache for constants (preferably in the form of special registers, or at least an L1 speed cache) has been widely implemented for years in Nvidia GPUs. It's called "constant memory", and reserves 64 KiB. See:

http://cuda-programming.blogspot.com/20 ... -cuda.html

http://stackoverflow.com/questions/1802 ... -practices

I was thinking of something smaller, like 256 bits in a reserved register for data of any type, and something like 2 or 4 KiB of stable application-specific L1-equivalent cache. The right figures should be empirically determined by careful research.

In fact, I don't think ISAs should settle on 32-bit and 64-bit types and registers unless those turn out to be optimal sizes, which I think is unlikely. As long as we can take care of alignment issues, I would determine the right type and register sizes empirically, based on what we know about common workloads and applications, as well as foreseeable future workloads. I think it's likely that 20-, 40-, and 80-bit types and registers will be more performant than 32/64. (I have no idea what the optimal vector register sizes would be, but Iike 320-bit. 64-bit is such a waste when used for memory addresses and pointers. On mobile and desktop, we should be fine with a 1 terabyte address space, which 40 bits give us.)


Author: Agner Date: 2016-08-18 09:01
Joe Duarte wrote:
Agner, by the way, I just learned that my idea for small cache for constants (preferably in the form of special registers, or at least an L1 speed cache) has been widely implemented for years in Nvidia GPUs. It's called "constant memory", and reserves 64 KiB.
ForwardCom can have read-only memory areas, but they are not guraranteed to be cached. I assume that you want a small memory area that is guaranteed to always be cached. That might actually give a speed gain since memory access is the most important bottleneck in many applications.

The Itanium did not have success with "explicit parallelism" because it was too difficult to handle for the compiler. Maybe "explicit caching" will be easier to handle. You might want to specify explicit caching for the top of the stack in the innermost loop or for often-used constants, jump tables or global variables.

There is a problem with task switching, though. The entire contents of the on-chip extra cache has to be swapped to main memory on each task switch. If you want fast task switching, you might want to have multiple cache banks with one for each process. That would be cumbersome.

Another possibility would be to reserve the extra cache for a particularly critical process. This process would monopolize the extra cache in one or more CPU cores while less important processes would be prevented from accessing it. That's an extra complication for the operating system to keep track of, but it is a quite common scenario that one process is speed-critical while all other processes are unimportant background processes.

Another possibility, which is perhaps simpler to handle, is an extra set of registers which are used for extra storage. Assume, for example, that we supplement the vector registers v0 - v31 by an extra set of vector registers w0 - w31. The extra registers can perhaps do nothing but read and write, e.g.

w8 = v5;
v2 = w8;

These extra registers can be used for often-used constants, for global variables, and for temporarily storing a normal register rather than saving it on the stack.

With an extra set of registers you have the complication of calling conventions. ForwardCom has a mechanism for telling the compiler which registers are modified by e.g. a library function. This mechanism could be extended to the extra registers. A library function could save a callee-save register to an extra register rather than saving it on the stack and reserve this extra register by indicating in the object file which extra registers it is using. The extra registers still have to be saved at every task switch if they are used.
In fact, I don't think ISAs should settle on 32-bit and 64-bit types and registers unless those turn out to be optimal sizes, which I think is unlikely. As long as we can take care of alignment issues, I would determine the right type and register sizes empirically, based on what we know about common workloads and applications, as well as foreseeable future workloads.
The alignment problem is serious. The hardware would have to support misaligned memory accesses as well as vectors of odd-size elements, which would be quite expensive for the hardware design and for performance. There is also a standardization issue.


Author: Joe Duarte Date: 2016-08-31 06:56
Agner wrote:
The Itanium did not have success with "explicit parallelism" because it was too difficult to handle for the compiler. Maybe "explicit caching" will be easier to handle. You might want to specify explicit caching for the top of the stack in the innermost loop or for often-used constants, jump tables or global variables.

Another possibility would be to reserve the extra cache for a particularly critical process. This process would monopolize the extra cache in one or more CPU cores while less important processes would be prevented from accessing it. That's an extra complication for the operating system to keep track of, but it is a quite common scenario that one process is speed-critical while all other processes are unimportant background processes.

Another possibility, which is perhaps simpler to handle, is an extra set of registers which are used for extra storage. Assume, for example, that we supplement the vector registers v0 - v31 by an extra set of vector registers w0 - w31. The extra registers can perhaps do nothing but read and write, e.g.

w8 = v5;
v2 = w8;

These extra registers can be used for often-used constants, for global variables, and for temporarily storing a normal register rather than saving it on the stack.

With an extra set of registers you have the complication of calling conventions. ForwardCom has a mechanism for telling the compiler which registers are modified by e.g. a library function. This mechanism could be extended to the extra registers. A library function could save a callee-save register to an extra register rather than saving it on the stack and reserve this extra register by indicating in the object file which extra registers it is using. The extra registers still have to be saved at every task switch if they are used.
I like your idea of the extra registers that do nothing but read/write. The key of what I want is stability and perhaps determinism in being able to cache application-specific data, constants, masks, etc. The CPU cache right now is a jungle.

I'm thinking about other ways to improve code density and decode efficiency. 1) Turn time or sequence into information. For example, maybe some instructions are only allowed at the start of a process. After a certain milestone, some instruction codes switch to represent different instructions – instructions that cannot be used at the start of a process, only after the milestone. This allows you to encode more instructions in the same 32-bit space. Are there any instructions that will never be used at the beginning of a program/process? 2) More efficient data types, beyond integers and floats. For example, I wonder if we could use a powers-of-two type, such that the literal binary value would represent an exponent of two, e.g. 10001101 would mean 2^141. Something like that would help with encoding some large numbers, though powers of two might not be the right form. Relatedly, instead of hard-coding a single constant like π the way I was thinking several months ago, there could be a type that represented a set of constants – maybe four or five bits to encode 16 to 32 predesignated constants that were empirically determined to recur frequently enough in a wide range of applications to be worth encoding. This would basically be a predefined static dictionary. It could be stuff like 32-bit pi, 64-bit pi, etc. or maybe no version of pi would make the cut. I have no idea. Do you think there are a set of a dozen or more constants that would be useful to put in a dictionary?

And I wonder about simpler, more specialized cores. Right now you've got all these powerful AVX, AVX2, BMI, FMA, TSX instructions and all the legacy x86 and 386 instructions supported on every single core. Maybe it would make sense to have a couple of Turbo cores or something that were SIMD-only, or optimized for heavy compute or HPC workloads. There's been some research on how to design a processor tailored for OpenCL, for example. Maybe ForwardCom could have a simpler, faster subset of the ISA designed for highly parallelized applications using OpenMP, OpenCL, or something like Threaded Building Blocks. It would be like Big/Little, but more specifically aimed at high performance parallel code in the little cores. A new ISA doesn't have the backwards compatibility constraints Intel is limited by, where they throw the kitchen sink into every core.


Author: Agner Date: 2016-08-31 11:08
Joe Duarte wrote:
For example, maybe some instructions are only allowed at the start of a process. After a certain milestone, some instruction codes switch to represent different instructions – instructions that cannot be used at the start of a process, only after the milestone.
The Opcode numbers are not so crowded that this would be worth the extra complexity, but some code spaces are reserved for application-specific or user-defined instructions.
I wonder if we could use a powers-of-two type, such that the literal binary value would represent an exponent of two, e.g. 10001101 would mean 2^141. Something like that would help with encoding some large numbers, though powers of two might not be the right form.
ForwardCom includes formats that code constants as a << b = a · 2b, but only for immediate constants, not for variables. It also supports half precision floating point constants - but not variables.
Relatedly, instead of hard-coding a single constant like π the way I was thinking several months ago, there could be a type that represented a set of constants – maybe four or five bits to encode 16 to 32 predesignated constants that were empirically determined to recur frequently enough in a wide range of applications to be worth encoding.
If we have an extra set of registers, these could be used for constants that are reused in a program.


Author: Jorcy Neto Date: 2016-09-01 12:16
Joe Duarte wrote:
And I wonder about simpler, more specialized cores. Right now you've got all these powerful AVX, AVX2, BMI, FMA, TSX instructions and all the legacy x86 and 386 instructions supported on every single core. Maybe it would make sense to have a couple of Turbo cores or something that were SIMD-only, or optimized for heavy compute or HPC workloads. There's been some research on how to design a processor tailored for OpenCL, for example. Maybe ForwardCom could have a simpler, faster subset of the ISA designed for highly parallelized applications using OpenMP, OpenCL, or something like Threaded Building Blocks. It would be like Big/Little, but more specifically aimed at high performance parallel code in the little cores. A new ISA doesn't have the backwards compatibility constraints Intel is limited by, where they throw the kitchen sink into every core.
Indeed, the 1st-gen Xeon Phi (a.k.a. Knights Corner) was basically a 64-bit extended P54C core (pre-MMX Pentium) with a custom (sorta like AVX-512F ) 512-bit Vector ISA. And it was actually marketed as a non MMX/SSE/AVX compatible product.

As the current gen Xeon Phi (Kngiths Landing) has adopted AVX-512 but also the whole MMX/SSE/AVX sets, an "AVX-512 only" core would surely be concievable and even desirable on a hybrid desing (like the Big/Small scheme). But I guess Intel wouldn't like to maket another non-backwards-compatible CPU.


Author: Yuhong Bao Date: 2016-07-12 03:35
It is probably not difficult to create a new x86 version that is user mode compatible with most modern programs but lacks things like segmentation (including the GDT/LDT) and real mode. New OS versions would be required, but most modern user mode programs would work with few if any modifications. Anyone want to do a proposal for that?


Author: Hubert Lamontagne Date: 2016-07-12 09:11
Yuhong Bao wrote:
It is probably not difficult to create a new x86 version that is user mode compatible with most modern programs but lacks things like segmentation (including the GDT/LDT) and real mode. New OS versions would be required, but most modern user mode programs would work with few if any modifications. Anyone want to do a proposal for that?
Afaik, this is one of the things designed into UEFI (a pathway to removing 16bit/real mode and segmentation by booting straight into 32bit or 64bit mode).

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Things from MIPS (and novel things)

Post by agner » Thu Nov 02, 2017 12:27 pm

Author: Anonymous Date: 2016-07-28 07:10
Agner, have you studied the MIPS architecture? What do you think of "branch delay slots" ? Personally, I like them, because of the following:

There are things the software developer (and the compiler) easily knows, but the hardware developer will hardly know it. The instruction set should give the software developer means to express that knowledge to the hardware.

One thing software developers know is which procedure will be called on a function pointer (because of polymorphism, indirect branches have become quite popular), but the hardware couldn't possible know it. When indirect branch instructions like "CALL (register)", "JR (register)" or "JALR (register)" are executed, the hardware is absolutely clueless as to where to move forward, because it needs to know the value of the register before proceding (which it likely won't know. Because of the modern pipelined approach, it needs the write-back value of the register). This will very likely lead to a stall.

But, with the instructions "SETCALL (register)" and "CALL (no arguments)", "SETJUMP (register)" and "JUMP (no arguments)", it's possible to execute instructions while the processor is fetching instructions from the target address, like this:

SETCALL (register) ; sets the call target, the processors starts to fetch data from it.
instruction one...
instruction two...
instruction n...
CALL (no arguments) ; calls the procedure pointed by the call target.

The delay slot gives the processor time to prepare for the call or jump, and relieves the branch predictor of work. The branch predictor might be completely removed from the processor with this approach (more silicon for the hardware developer!). I have never seen any processor with instructions like that (and I have studied the 68000, x86, x64, ARM and MIPS). "if" statements could also benefit from this approach.

Delay slots are not novel. The novel part here is its size can be arbitrarily set.

I like the statements on the section "10.8 Function libraries and link methods". I wonder if they are there because of the indirect branch problem.

Cache miss is also a big topic. Suppose I have five functions (functions in the mathematical sense), A, B, C, D, E, which can be fully executed in parallel. The calls instructions are laid in the following order:

CALL A; CALL B; CALL C; CALL D; CALL E;

But only the data for the E function is ready (is on the cache). Would the processor stall, while waiting for the data of A, B, C, D arrive? And, even worse, wait for the completion of A, B, C and D?

I like that vectors are present on this new instruction set and are an important part of it. Many algorithms use matrix multiplication, which can greatly benefit from instruction-level parallelism, but the current instructions sets are simply disappointing. I'm currently trying to write a fast 2048-bit modular multiplier (for exponentiation with 2048-bit exponents, which should happen under tens of milliseconds, and currently is taking hundreds of milliseconds), and, although Toom-Cook algorithm uses matrix multiplication, it's simply not possible to express it nicely with the instructions given by x86 or x64.


Author: Agner Date: 2016-07-28 12:16
Anonymous wrote:
What do you think of "branch delay slots" ?
I would categorize this under the label of explicit parallelism. You are explicitly specifying what the processor should do while fetching a jump target. I think there is a problem with letting the compiler decide what to do in parallel rather than leaving this decision to the CPU. The compiler will have to know which things the processor can or cannot do in parallel and how many instructions it can execute in the time it takes to fetch the jump target. That will lock you into a particular hardware implementation. If you later develop a new processor with different timings or a different degree of parallelism then you cannot run optimally on the new processor unless you compile again for this particular processor.

We have seen this problem with the Itanium where instructions were put together in blocks of three. The decision of which three instuctions to execute in parallel was left to the compiler rather than the CPU. It was very difficult to make such a compiler, and you were pretty much locked into a fixed degree of parallelism. I think it is better to have a genuine superscalar design with full out-of-order capabilities. Then you can run the same binary code on different CPUs with different amounts of parallelism, and the compiler doesn't need to know the details of the CPU.
One thing software developers know is which procedure will be called on a function pointer (because of polymorphism, indirect branches have become quite popular), but the hardware couldn't possible know it.
In a typical application, the value of the function pointer will be available long before the values of the function parameters so that the call instruction can be executed out of order. The CPU will execute the call instruction and fetch and decode the instructions inside the function while it is waiting for the function parameters.
Cache miss is also a big topic. Suppose I have five functions (functions in the mathematical sense), A, B, C, D, E, which can be fully executed in parallel. The calls instructions are laid in the following order:

CALL A; CALL B; CALL C; CALL D; CALL E;

But only the data for the E function is ready (is on the cache). Would the processor stall, while waiting for the data of A, B, C, D arrive? And, even worse, wait for the completion of A, B, C and D?
No. Assuming that there are no code cache misses, but only data cache misses, a superscalar CPU will fecth and decode everything in advance, if the out-of-order buffer is big enough, while waiting for data. If the data for E are available before the data for the preceding functions, then E will be executed first.
I like that vectors are present on this new instruction set and are an important part of it. Many algorithms use matrix multiplication, which can greatly benefit from instruction-level parallelism, but the current instructions sets are simply disappointing.
Matrix multiplication often involves a lot of data permutation and horizontal addition which takes extra time.


Author: Hubert Lamontagne Date: 2016-07-28 13:11
Anonymous wrote:
What do you think of "branch delay slots" ?
My opinion on branch delay slots is that they're great for a certain size of CPUs: large enough to be 32 bits and maybe have a code cache (so that it's not limited by instruction memory bandwidth), but small enough that it doesn't have a branch predictor. MIPS is a perfect example of this, and tons of MIPS have been used in stuff like Playstations where they're the ideal size. Not to mention the whole "fast microcontroller" market - which is right now dominated by ARMs (which don't have a branch delay slot but would be the right kind of CPU for it). Afaik, Nvidia's Denver architecture (in-order VLIW, very modern) has branch delay slots.

Once your CPU is large enough to have a branch predictor, the speed benefit shrinks a lot. And when your CPU becomes out-of-order, then the branch delay slot becomes a bit of a liability since it increases the complexity of the corner cases, and out-of-order is already forcing to have a rollback mechanism anyways so why not use it to run instructions speculatively past branches? And when your pipeline becomes longer, this means you still have stalls unless your branch delay slot is very long: for instance, if your whole issue stage is 5 cycles, and your CPU runs 3 instructions per cycle, you need a branch delay slot of 15 instructions... kinda a lot, no?

That being said, your proposal is interesting, and for a semi-ordered CPU (something in between out-of-order and in-order) would probably be a good idea, and is not too risky because even if it turns out to not be a good idea, it can still probably be implemented fairly easily in an out-of-order CPU. But I don't think I've ever seen any good examples of semi-ordered architectures - once you go past a certain complexity, it becomes simpler to implement the Tomasulo algorithm and register renaming and go full out-of-order.

Anonymous wrote:
Cache miss is also a big topic. Suppose I have five functions (functions in the mathematical sense), A, B, C, D, E, which can be fully executed in parallel. The calls instructions are laid in the following order: CALL A; CALL B; CALL C; CALL D; CALL E; But only the data for the E function is ready (is on the cache). Would the processor stall, while waiting for the data of A, B, C, D arrive? And, even worse, wait for the completion of A, B, C and D?
This is what out-of-order CPUs do! Provided that you don't exceed the limits on how much instructions the CPU can reorder and what situations it can deal with, they're designed to run at least the parts of functions further up ahead that generate memory addresses, which lets you deal with a lot more memory latency.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Matrix multiplication

Post by agner » Thu Nov 02, 2017 12:29 pm

Author: Agner Date: 2016-07-29 07:51
Anonymous wrote:
Many algorithms use matrix multiplication, which can greatly benefit from instruction-level parallelism, but the current instructions sets are simply disappointing [...] it's simply not possible to express it nicely with the instructions given by x86 or x64.
I wonder if matrix multiplication is so important that it would be worthwhile to implement special instructions that facilitate matrix multiplication. A complete matrix multiply instruction would be far too complex, but instructions for doing the necessary permutations would be realistic.

Assume that we have a NxM matrix A and a MxP matrix B and we want to calculate the NxP product AB. The matrixes are stored in row-major order. We can calculate the matrix product in the following way. Copy the first column of A into P identical columns to make a NxP matrix with all columns identical. Copy the first row of B into N identical rows to make a NxP matrix with all rows identical. Multiply the two NxP matrixes, element by element. Do the same with the second column of A and the second row of B, and so on. Now we have M matrixes of size NxP. The result is the sum of these M matrixes.

If we have a universal permute instruction then we can copy the rows or columns without problems. However, the complexity of a full permute instruction grows with the vector length squared so this may be too expensive for very long vectors. Instead, we could have a repeat-block instruction which repeats the first sub-vector into a long vector, and a repeat-within-block instruction that copies the first element of each subvector into all elements of their subvector block. These instructions can be used for repeating the rows and columns, respectively. If the vector register is not long enough to hold the entire matrix, then split the matrix into multiple registers, each holding an integral number of rows.

This will still be somewhat complicated to implement, but much cheaper than a universal permute instruction.


Author: Hubert Lamontagne Date: 2016-07-29 09:33
ARM's simd has instructions that can help matrix multiply and are also useful for a whole bunch of other simd tasks:

VMUL, VMLA, VMLS with broadcasted operand - Vector multiply by broadcasted scalar extracted from a vector register (+accumulating and subtracting versions)
http://infocenter.arm.com/help/index.js ... JDBDJ.html

VTRN - Matrix transpose
http://infocenter.arm.com/help/index.js ... DJAEA.html

VPADD - Pairwise add
http://infocenter.arm.com/help/topic/co ... GDEGD.html

VZIP, VUZP - Interleave/deinterleave vectors
http://infocenter.arm.com/help/topic/co ... CACAB.html

VDUP - Broadcast scalar extracted from a vector register to whole vector register
http://infocenter.arm.com/help/topic/co ... DIGCC.html


Author: John D. McCalpin Date: 2016-07-29 10:03
There are two problems with matrix multiplication in recent architectures:
1. The biggest problem with matrix multiplication comes from the need to keep many independent accumulators to tolerate pipeline latency. For example, on Intel's Haswell (Xeon E5 v3) processor with two FMA units and a 5-cycle FMA latency, a minimum of 10 accumulators are required. Each of these accumulators must be a full-width SIMD vector in order to use all the FMA hardware. It turns out that 10 accumulators is not a useful number for register blocking of matrix multiplication (5x2 and 2x5 don't provide good re-use), so the optimum algorithms must use 12 accumulators -- leaving only 4 register names for everything else. The optimal blocking uses 3 registers for values that are re-used in non-adjacent code, leaving a single register for handling all of the load-with-broadcast values that are used for the input array that requires transposition.
2. A secondary problem is the absence of transposition support. When going from AVX2 (described above) to AVX-512, the number of accumulators required remains the same, but the wider SIMD registers means that the code must implement 8 load-with-broadcast operations (for the transposed input array) in the same number of cycles that were required for 4 load-with-broadcast operations on Haswell. The only way that this can happen is to use a larger blocking factor for register blocking so that more data re-use occurs and more cycles are available with a free load port. Since AVX2 already required all 16 registers, an AVX-512 implementation clearly requires more -- so it is not surprising that the AVX-512 instruction set provides 32 named registers.

The first problem is probably the easiest to solve. A floating-point multiply-add unit with a single-cycle multiply-accumulate mode is described by S. Jain et al., in “A 90mW/GFlop 3.4GHz reconfigurable fused/continuous multiply-accumulator for floating-point and integer operands in 65nm,” VLSID ’10., 2010. This would reduce the number of accumulators to 2 (one per FMA unit) and (in the AVX2 example) would free 10 registers for holding data for re-use.

The second problem is definitely more difficult to solve. Although it would be relatively easy to build an FMA unit with a built-in broadcast of a scalar element, this would not help with getting the scalar element(s) into the register(s) in the first place. In the standard form of matrix multiplication, the values that need to be broadcast are collected from strided memory addresses (with a stride equal to the dimension of the matrix). Given that most (all?) recent processors have "wide" (multiple 64-bit word) interfaces at all levels of the cache/memory hierarchy, so there is physically no way to get full bandwidth while extracting only one element per block.

A possible route forward would be a SIMD "transpose with broadcast" instruction, taking a single vector of P contiguous elements and broadcasting each element across one of P output SIMD registers. The overall matrix multiplication algorithm would have to be organized so that all of these P elements would be used before being discarded, which restricts the space of possible orderings considerably. I have not worked through the details to see if this leads to a workable solution. A non-SIMD approach based on independent single-lane vector units operating from a scratchpad memory (rather than SIMD registers) might be preferred, but that opens up too many options to think about. Some encouraging developments in high-radix on-chip crossbar technology that would help in any of these approaches to parallelism is reported by Sudhir Satpathy, et al., in a number of publications, including
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf


Author: Agner Date: 2016-07-29 11:03
John D. McCalpin wrote:
A possible route forward would be a SIMD "transpose with broadcast" instruction, taking a single vector of P contiguous elements and broadcasting each element across one of P output SIMD registers.
Hubert argued that multiple outputs is a bad thing for an out-of-order CPU. Consequently, I made the ForwardCom design so that it allows only one output register for each instruction. But it allows very long vector registers. I am thinking about the situation where the vector register is long enough to contain the entire matrix, or at least several rows of it. What I proposed is the same as you described, except that it is all contained in a single vector register.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Introduction website

Post by agner » Thu Nov 02, 2017 12:32 pm

Author: Agner Date: 2016-08-01 01:48
I have made an introduction to the ForwardCom project at http://www.forwardcom.info.

I have added a few optional instructions to facilitate matrix multiplication.


Author: EricTL Date: 2017-07-17 18:15
Great proposal so far with a lot of interesting ideas.
May I suggest a couple of instructions:
-Bit reverse: Reverse all the bits of a register.
This is very useful to enable bit hack for the most significant bits.

-LUT base logical ops: Rd, Rs Imm: Rd OPS Rs -> Rd Where OPS is a 8 bits LUTs implementing any three input one output logical expression.
4 more bits of the immediate field will specify msb, lsb and the following 3 cases:
OPS(Rd, msb:Rd >> 1, Rs) -> Rd
OPS(Rd, Rd:lsb << 1, Rs) -> Rd
OPS(Rd:lsb << 1, msb:Rd >> 1, Rs) -> Rd
That instruction could replace all the other logical instructions. And, Or, and-not, ... and adds some new one.

-MultiCarryOps:
Rd ops Rs under Rc (Carry mask): Ops can be +, - >>, <<, rot The carry mask work as follow
Each bit set in the Rc indicate the msb of a sub bit string. i.e. the carry does not propagate to the left bit of a set bit in the carry mask.
The idea is to enable efficient bitfield ops.
A compressed version could use force a bit field to be at least two bits and offer Rd ops Rs where Rs 'or' together the carry mask and a single bit after each Cm bit set.
e.g. Cm = 0x0420: 0x7C1F + 0x0631 => 0x8020; 0x7C1F - 0x0631 => 0x77FE; 0x7C00 - 0x0230 => 0x7FE0
These instructions will enable a better use of bit-fields and enable more complex loop count in one instructions. Also if made atomic they could enable new lock free algorithms.


Author: Agner Date: 2017-07-18 00:28
EricTL wrote:
May I suggest a couple of instructions:
-Bit reverse: Reverse all the bits of a register.
This instruction exists. It is called bit_reverse.
LUT base logical ops:
Instruction truth_tab2 is a two-input LUT.
Instruction truth_tab3 is a three-input LUT.

These instructions implement arbitrary functions for boolean vectors. However, the function is applied only to the least significant bit of each vector element. There are two reasons for this limitation. (1) This is the way boolean vectors are represented in ForwardCom. The boolean value is stored in the least significant bit. The remaining bits are option bits, which can be propagated to the result and used for specifying options to e.g. a floating point instruction using the LUT result as a mask. (2) The 3-input LUT can be implemented in hardware by using an 8-bit barrel shifter where the 8 bits of the LUT are shifted right by the count defined by the three input bits. The hardware allready has an 8-bit barrel shifter for each 8-bit vector element so that the implementation is cheap. A LUT function that applies to all bits of a vector will require one 8-bit barrel shifter for each vector bit, which would be more expensive in terms of silicon space.

I am not sure I understand what you want to do with the msb, lsb inputs?
MultiCarryOps:
Rd ops Rs under Rc (Carry mask): Ops can be +, - >>, <<, rot The carry mask work as follow
Each bit set in the Rc indicate the msb of a sub bit string. i.e. the carry does not propagate to the left bit of a set bit in the carry mask.
Interesting idea. But this has no high level language equivalent. It will only be used by programmers who master the use of intrinsic functions or assembly language. A simpler solution is to just leave one unused bit between each bitfield to avoid carry from one bitfield to the next.

BTW, I have made the move_bits instruction for efficient insertion and extraction of bit fields.


Author: EricTL Date: 2017-07-20 19:19
I thought I replied already. Maybe I did not sent it.

Agner wrote:
Instruction truth_tab2 is a two-input LUT.
Instruction truth_tab3 is a three-input LUT.

These instructions implement arbitrary functions for boolean vectors. However, the function is applied only to the least significant bit of each vector element. There are two reasons for this limitation. (1) This is the way boolean vectors are represented in ForwardCom. The boolean value is stored in the least significant bit. The remaining bits are option bits, which can be propagated to the result and used for specifying options to e.g. a floating point instruction using the LUT result as a mask. (2) The 3-input LUT can be implemented in hardware by using an 8-bit barrel shifter where the 8 bits of the LUT are shifted right by the count defined by the three input bits. The hardware allready has an 8-bit barrel shifter for each 8-bit vector element so that the implementation is cheap. A LUT function that applies to all bits of a vector will require one 8-bit barrel shifter for each vector bit, which would be more expensive in terms of silicon space.

I am not sure I understand what you want to do with the msb, lsb inputs?
Let's put aside the shit left or right by one (the immediate lsb and msb where there to feed the shift on both end of the register. )

A 4 bit LUT can be made with 3 multiplexers (mux.) per bit.

I contend the logical part of the ALU will need at least 3 mux. per bit to choose from A | B, A & B, A & ~B and A ^ B.
So in essence for logical encoded instruction the 2 mux. control bits comme from the instruction while for the LUT base the two control bits come from the registers.

In the simplest implementation the 4 logical ops will be done first then the mux4:1 will choose which one.

In the LUT case while the registers are being read the immediate 4 bits for the LUT are amplified and broadcasted to the mux4:1.

In modern CMOS on SOI a mux2:1 is just 2 transistors. So if one counts two inversors (2 Tr x 2 ) for every 8 bits => 4 x 4 x (64 / 8) = 128 Tr for the broadcast
plus 64 x 6 Tr for the mux4:1

Total 512 Tr for the universal solutions vs a minimum of 64 x (6+1) for the mux4:1 part plus the cost of the four logical ops...
Granted it is over simplified but I guess anyone can get the just of it.

Moreover with the LUT solution one get also Not, Clear, Id (move) plus any other less common logical expression
and in a two registers instruction settings ( Rd = Rd ops Rs ) the LUT enable (Rd := Rd ops Rs) and (Rd := Rs ops Rd).

The case for 8 bits LUT (universal three-input logical operation) is less clear. The mux is now 8:1 => 14 Tr and the broadcast doubled to 256 Tr => 1152 Tr.
I still think that instruction can be leveraged by a compiler to, at the very least, make the code smaller.

To finish on a rather hackish and somewhat silly note In the case of the tiny instruction format the 4 bits LUT can be implemented by using the code for Rs as both the register and the LUT. In that form Clear can be implemented as LUT R0 therefore reclaiming the opcode.

Not opcode can be reclaimed as well by LUT R5 (LUT convention b3,b2,b1,b0 -> 0101)
Freeing a opcode for something else. A clever compiler should be able to play with the limitation somewhat and get some further optimization.


Author: Agner Date: 2017-07-20 23:27
I don't like the idea of using the same bitfield as both instruction and register. You will not have a free choice of what register to use, and the out-of-order scheduler will have problems detecting whether it should wait for the value of that register or not.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for instruction set - now on Github

Post by agner » Thu Nov 02, 2017 12:41 pm

Author: Joe Duarte Date: 2016-08-04 01:23
Agner, there are a couple of specialized instruction domains that you seem to be pushing to FPGAs or other coprocessors, but I think ForwardCom will suffer from their exclusion.

Encryption may be specialized, but it's not optional anymore. Fast encryption is going to be a central requirement of any computing device for decades. So the lack of AES and similar instructions is a disadvantage. You could say that FPGAs can do it, but computers today are expected to be able to do this without extra silicon. FPGAs are not in anyone's Bill of Materials for a smartphone, PC, or server (well, in rare cases for servers), and there's still a lot of doubt over whether there are enough programmers who know how to fully leverage FPGAs.

Parsing text and bytecode is a huge part of what computers do every day, so I think a new ISA should take a clean look at how to speed up parsing (and JIT'ing). You don't have any counterpart to the SSE 4.2 string comparisons, and I wouldn't want to punt that to FPGAs either. I know that programs rarely use those SSE 4.2 instructions today, but that sort of thing will become more common in the future as there is more pressure on energy use in cloud computing and battery life on mobile. At some point someone is going to build a clean-sheet web server that fits entirely in the L2/L3 cache.

Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?


Author: Agner Date: 2016-08-04 14:37
Joe Duarte wrote:
Fast encryption is going to be a central requirement of any computing device for decades. So the lack of AES and similar instructions is a disadvantage.


You are probably right that AES instructions would be useful.
Parsing text and bytecode is a huge part of what computers do every day, so I think a new ISA should take a clean look at how to speed up parsing (and JIT'ing). You don't have any counterpart to the SSE 4.2 string comparisons, and I wouldn't want to punt that to FPGAs either.
Text strings are rarely so long that it really matters. It will take less time to parse a string than to read it from an SSD anyway.

x86 AES and SSE4.2 instructions are still only 128 bits while most other vector instructions are 256 - and soon 512 - bits. This is because they are costly to implement. In most cases, it will be faster to parse a string using simple compare instructions with the maximum vector length than a SSE4.2 instruction limited to 128 bits.
Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?
Instruction fetch and decoding is a bottleneck on all Intel processors. They are putting a micro-operation cache after the decoder to fix this, but each micro-op uses many bits so physical distances on the silicon chip is a limitation. The ForwardCom instruction format will be more efficient than this because it can decode fast and it is compact. x86 has a lot of suboptimal designs because it needs to preserve backwards compatibility with a long legacy of short-sighted designs and patches. They even keep supporting undocumented opcodes from the old 8086.

It is difficult to predict how much you can gain from having longer vectors. Some applications will be unable to use long vectors. The special vector loop will be an advantage whenever the loop count is not certain to be a multiple of the vector size.


Author: Hubert Lamontagne Date: 2016-08-05 12:25
Joe Duarte wrote:
Agner, there are a couple of specialized instruction domains that you seem to be pushing to FPGAs or other coprocessors, but I think ForwardCom will suffer from their exclusion.

Encryption may be specialized, but it's not optional anymore. Fast encryption is going to be a central requirement of any computing device for decades. So the lack of AES and similar instructions is a disadvantage. You could say that FPGAs can do it, but computers today are expected to be able to do this without extra silicon. FPGAs are not in anyone's Bill of Materials for a smartphone, PC, or server (well, in rare cases for servers), and there's still a lot of doubt over whether there are enough programmers who know how to fully leverage FPGAs.
What about a specialized AES encryption device, accessed using memory mapped I/O like other peripherals like DMA controllers, timers etc? Wouldn't that do the job, and keep the overall complexity lower? The OS would probably have to make sure no 2 programs are using the AES device at the same time but otherwise it would be pretty straightforwards: set the key, push the data through. Afaik that's close to how ARM does it.

Joe Duarte wrote:
Parsing text and bytecode is a huge part of what computers do every day, so I think a new ISA should take a clean look at how to speed up parsing (and JIT'ing). You don't have any counterpart to the SSE 4.2 string comparisons, and I wouldn't want to punt that to FPGAs either. I know that programs rarely use those SSE 4.2 instructions today, but that sort of thing will become more common in the future as there is more pressure on energy use in cloud computing and battery life on mobile. At some point someone is going to build a clean-sheet web server that fits entirely in the L2/L3 cache.
Text parsing is hard to speed up. It tends to be made up of loops full of loads, stores, and conditional branches, and not very many of the kind of mathematical operations that can be sped up. It's inherently serial in nature. Afaik, there are only 2 approaches for this at the moment:

- If your load is to run a whole bunch of programs at the same time (= it's a server), you can use the UltraSparc design: a really simple in-order CPU that uses tons of hyper-threading, and every time something stalls just switch to the next thread. The simple in-order core takes a lot less space so you can cram a whole bunch of them on the same chip, which lets you compensate for speed deficiencies.

- Otherwise, just use a general purpose out-of-order CPU that's as fast as possible. x86 fits this profile perfectly (which is why it's so popular still) and ARM is evolving towards that too. This is part of why RISC is popular: starting with a classic 4-issue out-of-order CPU like the Dec Alpha 21264, it's very hard to find new instructions to add that will actually speed it up at text processing and not be overly specific, and afaik the stuff that speed it up the most are just making branch prediction and data cache as fast as possible (look at x86 evolution for a perfect example of this). Truth is, if your text processing program has the kind of stuff that doesn't run fast on x86, chances are, it's doing stuff like defeating the branch predictor or data caching, and it's probably not going to run fast on ANY hardware. In fact, there are even good chances that it's not even filling the 4-instructions per cycle limit that modern Pentium-2 descendant cores can do.

Bytecode interpreters fall into a similar category of load-store-jump code and are hard to speed up for similar reasons. A way to put it is that while you can use parallelism to increase the number of calculations per second, you can't use it to increase the number of decisions per second, which tends to be limited by electrical constraints (clock rate, number of cycles of latency on the data cache, number of cycles of latency on issue - which can be reduced with a simpler instruction encoding but not past a certain limit).

I guess there's the special exception of the CPU designed to use dynamic translation - like the Transmeta Crusoe or NVidia's Denver (which in theory the Nintendo NX could use). The catch is that even then, these cores tend to perform good on loopy-mathy code, but not very well on loady-story-jumpy code (which is why the Denver performs better than ARM cores on benchmarks but worse in real world applications), and bytecode tends to be used for loady-story-jumpy code. It also doesn't help that high level programming languages tend to have all sorts of features that make the language inherently slow (one example that comes to mind is Python's data structures that I think use pointers to pointers just for getting to an object's data) and hard to dynamically compile (because they allow dynamic typing hacks, which makes it hard to protect the interpreter against corner cases).

Joe Duarte wrote:
Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?
That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core. My hunch is that it would run at about the same speed as x86, but simply require less design time to implement all the crazy instruction unpacking and 286-era segmenting intricacies. I don't think ForwardCom would scale very well past 4 instructions per cycle, but to be fair I don't think any instruction set does (they tried to do it with Itanium and failed).

Imho most of the ideas and innovation and potential speed gain in ForwardCom is in the vector instructions. This means that it has good potential for numeric-heavy applications (video games! physics!), if you use vectors, which generally means you have to program in C++ and use intrinsics (or use assembly). A few C++ compilers can auto-vectorize (Intel's compiler, MSVC 2012 and later and GCC with all speed options turned on) but it has to be "trivial loops" - if any of the memory accesses alias any other memory access in the loop, or there's any sort of carried state, then it's impossible to vectorize and the compiler will issue normal instructions (ie scalar integers and floats). Basically the compiler has to prove that each loop iteration is 100% independent from other iterations, which generally only happens in numeric-heavy code looping over a buffer and applying exactly the same mathematical operation on each iteration and data coming in neat linear arrays.

For higher level languages (anything dynamic typed like javascript or pyton etc), those are not designed for vectorization in any way so it would probably run 100% scalar. For middle level languages (Java, C#), those in theory are vectorizable but only in the same cases that C++ is vectorizable (trivial loops doing numerical computation), and generally people use C++ (video games) or Fortran (scientific computing) for those applications so there's not really any point AFAIK.


Author: Agner Date: 2016-08-06 02:44
Hubert Lamontagne wrote:
What about a specialized AES encryption device, accessed using memory mapped I/O like other peripherals like DMA controllers, timers etc? Wouldn't that do the job, and keep the overall complexity lower? The OS would probably have to make sure no 2 programs are using the AES device at the same time but otherwise it would be pretty straightforwards: set the key, push the data through. Afaik that's close to how ARM does it.
That's a pretty expensive solution. I wouldn't recommend a clone of the x86 AES instructions either, because they have long latencies which is bad for the smooth pipeline structure of ForwardCom. I would prefer to split up the AES operations into small unit operations with single-clock latency. If the length of vector registers is at least 2048 bits and we have full byte permute instructions then you can have a complete 256 byte lookup table contained in one vector register and do the table lookup with a byte permute instruction. The whole AES algorithm could then be implemented with general permute and XOR instructions without any special-purpose instructions. This would be a flexible solution that could be adapted to other encryption algorithms as well. The price is a large permutation circuit.
Text parsing is hard to speed up. It tends to be made up of loops full of loads, stores, and conditional branches, and not very many of the kind of mathematical operations that can be sped up. It's inherently serial in nature.
The branch misprediction penalty is equal to the length of the pipeline. The best way to speed up unpredictable branches is to have a short pipeline or to speculatively execute both sides of a 2-way branch simultaneously. Some text processing jobs can be vectorized, using masked operations etc. UTF-8 conversion can be done with the compress_sparse and expand_sparse instructions in ForwardCom (sorry, they are poorly described in the manual).
Bytecode interpreters fall into a similar category of load-store-jump code and are hard to speed up for similar reasons. A way to put it is that while you can use parallelism to increase the number of calculations per second, you can't use it to increase the number of decisions per second, which tends to be limited by electrical constraints
ForwardCom has a multiway branch instruction which is useful for byte code interpreters, but I think the way to speed up byte code is JIT complation.

Joe Duarte wrote:
Have you been able to guess at performance improvements from the new ISA? Holding everything else equal, the foundry technology, etc. would ForwardCom be faster or more efficient than a comparable Broadwell-E? I'm thinking of your new loops, for example, and other features. What sense do you have of the performance difference?
Hubert Lamontagne wrote:
That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core.
The target of ForwardCom is indeed to beat the performance/price ratio of existing systems. It is designed with a simple and consistent pipeline structure. It is doing one thing taking one clock cycle at each pipeline stage, except for multiplication, division, and a few other necessary things.

We should not forget, however, that in many applications the bottleneck is not the CPU but cache, memory access, disk access, or network speed.


Author: fanoI Date: 2016-08-08 06:00
Hello Agner,

what about an hardware accelerated Decimal Floating Point Unit as present in IBM Power CPUs? In particular for business applications the decimal type is necessary but now being done in software is 100x slower that float / double!


Author: Agner Date: 2016-08-08 10:26
fanoI wrote:
what about an hardware accelerated Decimal Floating Point Unit
The x86 instruction set has instructions for integer decimal arithmetics. However, these instructions are removed in the 64-bit mode because they are never used.

The normal procedure is to convert decimal to binary, do calculations in binary, and convert the result to decimal. These conversions can be vectorized to speed up the process. You can find code examples for this in my vector class library. Anyway, decimal numbers are for human use and the bottleneck will always be the human, not the computer :-)


Author: fanoI Date: 2016-08-09 13:51
Agner wrote:
fanoI wrote:
what about an hardware accelerated Decimal Floating Point Unit
The x86 instruction set has instructions for integer decimal arithmetics. However, these instructions are removed in the 64-bit mode because they are never used.

The normal procedure is to convert decimal to binary, do calculations in binary, and convert the result to decimal. These conversions can be vectorized to speed up the process. You can find code examples for this in my vector class library. Anyway, decimal numbers are for human use and the bottleneck will always be the human, not the computer :-)
I think the instructions you are referring were for the old BCD mode I was referring to the actual Decimal Floating Point:
https://en.wikipedia.org/wiki/Decimal_floating_point

they exist in C as GCC extensions (_Decimal32, _Decimal64 and _Decimal128), in C++ as decimal32, decimal64 and decimal128 and in C# there is the Decimal type. To do calculation with these types you need to resort to do them in software in x86 CPU (in Power CPU IBM has added a Decimal Floating Point Unit do this in hardware) they are used for financial calculations where powers of 10s are more important that power of 2s :-)


Author: Joe Duarte Date: 2016-08-08 18:41
Hubert Lamontagne wrote:
Text parsing is hard to speed up. It tends to be made up of loops full of loads, stores, and conditional branches, and not very many of the kind of mathematical operations that can be sped up. It's inherently serial in nature. Afaik, there are only 2 approaches for this at the moment:

- If your load is to run a whole bunch of programs at the same time (= it's a server), you can use the UltraSparc design: a really simple in-order CPU that uses tons of hyper-threading, and every time something stalls just switch to the next thread. The simple in-order core takes a lot less space so you can cram a whole bunch of them on the same chip, which lets you compensate for speed deficiencies.

- Otherwise, just use a general purpose out-of-order CPU that's as fast as possible. x86 fits this profile perfectly (which is why it's so popular still) and ARM is evolving towards that too. This is part of why RISC is popular: starting with a classic 4-issue out-of-order CPU like the Dec Alpha 21264, it's very hard to find new instructions to add that will actually speed it up at text processing and not be overly specific, and afaik the stuff that speed it up the most are just making branch prediction and data cache as fast as possible (look at x86 evolution for a perfect example of this).
I'm thinking of approaches Parabix: http://parabix.costar.sfu.ca/
One of their papers: https://www.cs.sfu.ca/~ashriram/publica ... arabix.pdf

I guess it just comes down to vector instructions and their parallel bitstreams approach. (Another way to boost it would be multiplexed streams, but that's a different plane of the architecture than the ISA.)

Hubert wrote:
That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core. My hunch is that it would run at about the same speed as x86, but simply require less design time to implement all the crazy instruction unpacking and 286-era segmenting intricacies. I don't think ForwardCom would scale very well past 4 instructions per cycle, but to be fair I don't think any instruction set does (they tried to do it with Itanium and failed).
Well, Broadwell-E is a high target right now, but the target is moving. It's going to be a low target in 2020. Of course Agner's not necessarily thinking in terms of marketable silicon, but more toward a useful reference and engineering exercise. I'm just wondering what kind of performance wins are possible. Another way of framing it: If a well-funded team rolled up their sleeves and built a new general purpose CPU architecture from scratch, including a new ISA and tape out and all that, using TSMC or Samsung's 16/14 nm process, could they beat Intel/Skylake? Maybe Intel is wringing all the performance out of current processes that the physics allows, and the legacy technical debt is only a design cost overhead. They seem to be treading water since Haswell, so I sometimes wonder if a better architecture would make a difference or if the wall is more fundamental than Intel.

By the way, do you think Itanium was just too early, with too little tooling? Does an out of order CISC have any inherent advantage over VLIW or EPIC? It seems goofy to send one instruction at a time. I haven't found any deep-dive post-mortems on Itanium, just vague claims that no one could build a good compiler for it, or that the physical product just wasn't that fast. I think we're going to see some big advances in compiler technology in the near future, with powerful platforms like IBM Watson doing things a laptop-bound compiler can't do.

By the way, regarding the idea of specialized AES encryption device, Intel offered QuickAssist on the v2 and v3 Xeons: http://www.intel.com/content/www/us/en/ ... rview.html

They don't seem to offer it on the Broadwells. It's a dedicated accelerator for encryption and compression, but I've never found a detailed review anywhere. Do either of you know more about it? I'm surprised it didn't get more attention.


Author: Hubert Lamontagne Date: 2016-08-09 17:49
Joe Duarte wrote:
I'm thinking of approaches Parabix: http://parabix.costar.sfu.ca/
One of their papers: https://www.cs.sfu.ca/~ashriram/publica ... arabix.pdf

I guess it just comes down to vector instructions and their parallel bitstreams approach. (Another way to boost it would be multiplexed
streams, but that's a different plane of the architecture than the ISA.)

Hmm, that's an interesting technique. Very reminiscent of how they used to do graphics on the Amiga and in EGA (which worked well for 2d but had to be abandoned once they started doing 3d).

Joe Duarte wrote:

Hubert wrote:
That's a pretty high target! To get anything close to this performance, you'd need to run ForwardCom at about 4 instructions per cycle, and considering the complexity of ForwardCom instructions, this would require a pretty large and complex CPU core. My hunch is that it would run at about the same speed as x86, but simply require less design time to implement all the crazy instruction unpacking and 286-era segmenting intricacies. I don't think ForwardCom would scale very well past 4 instructions per cycle, but to be fair I don't think any instruction set does (they tried to do it with Itanium and failed).
Well, Broadwell-E is a high target right now, but the target is moving. It's going to be a low target in 2020. Of course Agner's not necessarily thinking in terms of marketable silicon, but more toward a useful reference and engineering exercise. I'm just wondering what kind of performance wins are possible. Another way of framing it: If a well-funded team rolled up their sleeves and built a new general purpose CPU architecture from scratch, including a new ISA and tape out and all that, using TSMC or Samsung's 16/14 nm process, could they beat Intel/Skylake? Maybe Intel is wringing all the performance out of current processes that the physics allows, and the legacy technical debt is only a design cost overhead. They seem to be treading water since Haswell, so I sometimes wonder if a better architecture would make a difference or if the wall is more fundamental than Intel.
That's a good question. I think that's exactly what ARM is trying to do! In fact, there are a whole bunch of ARM licensees that are taking a stab at this, and imho it's proving to be a little bit harder than expected...

The catch is that once you have a cpu as big as a Broadwell, it has tons of stuff in it: in particular, a very complex memory architecture with a whole laundry list of features: aggressively pipelined multibank data cache with 2 read ports and 1 write port, TLB, out-of-order non-blocking everything with a memory ordering buffer, prefetching, multi-core cache coherency protocol, handling of misaligned addresses... This is basically almost totally independent of the architecture - you could implement all the same stuff on an ARM or MIPS or POWER core and it would be just as complex.

If you factor in the other stuff like aggressively pipelined secondary units like the FPU and SIMD unit, heavy branch prediction, hyper threading... this is all stuff that's basically the same on other architectures. The only thing that x86 changes is the front end. The front end part of a CPU isn't THAT large and complex compared to all the rest on modern cores with so many features.

And if, say, some new silicon process changes the distribution of bottlenecks so that faster instruction decoding becomes more important again, then Intel can simply put in a larger or broader micro-code cache, or maybe even go back to the instruction trace cache like on the Pentium 4. (ok, the P4 is a bit reviled, but the truth is that it still held on pretty good compared to the other well designed architectures of the time)


Joe Duarte wrote:
By the way, do you think Itanium was just too early, with too little tooling? Does an out of order CISC have any inherent advantage over VLIW or EPIC? It seems goofy to send one instruction at a time. I haven't found any deep-dive post-mortems on Itanium, just vague claims that no one could build a good compiler for it, or that the physical product just wasn't that fast. I think we're going to see some big advances in compiler technology in the near future, with powerful platforms like IBM Watson doing things a laptop-bound compiler can't do.
I think it boils down to a couple things... The first generation Itanium came out late. Because of this, RISCs and x86s of the day had an extra generation to catch up, which means that the first generation Itanium totally failed to impress. Second generation came out pretty fast and sorta fixed this, but Itanium could never quite reach up to the speed of the fastest RISCs and x86s. One guy said that it was never as fast as HP's previous architecture (PA-RISC, totally a classic RISC).

Second thing, Itanium was basically betting that out-of-order would fail to deliver enough speed gains to keep the RISCs and x86s relevant. It turns out that out-of-order delivers speed gains at just the right place: memory accesses. Real life programs have lots of memory accesses going on, which gives out-of-order a fairly hard-to-beat edge. It's just easier to deal with memory latency with an out-of-order pipeline than with the Itanium's insane ALAT thing with checked speculative loads and stores and whatnot. Late Itaniums can issue 12 instructions per cycle, which x86 will probably never be able to do, but even though x86 only issues 4 instructions per cycle, 3 of those can have a built-in memory access which can be crazy-reordered by the aggressive out-of-order pipeline, and it turns out that that's enough to smoke the Itanium.

There's a question on Quora about exactly this and it has tons of interesting answers:
Why did the Intel Itanium microprocessors fail? https://www.quora.com/Why-did-the-Intel ... ssors-fail


Author: Joe Duarte Date: 2016-08-11 19:09
Hubert Lamontagne wrote:
The catch is that once you have a cpu as big as a Broadwell, it has tons of stuff in it: in particular, a very complex memory architecture with a whole laundry list of features: aggressively pipelined multibank data cache with 2 read ports and 1 write port, TLB, out-of-order non-blocking everything with a memory ordering buffer, prefetching, multi-core cache coherency protocol, handling of misaligned addresses... This is basically almost totally independent of the architecture - you could implement all the same stuff on an ARM or MIPS or POWER core and it would be just as complex.

If you factor in the other stuff like aggressively pipelined secondary units like the FPU and SIMD unit, heavy branch prediction, hyper threading... this is all stuff that's basically the same on other architectures. The only thing that x86 changes is the front end. The front end part of a CPU isn't THAT large and complex compared to all the rest on modern cores with so many features.

And if, say, some new silicon process changes the distribution of bottlenecks so that faster instruction decoding becomes more important again, then Intel can simply put in a larger or broader micro-code cache, or maybe even go back to the instruction trace cache like on the Pentium 4. (ok, the P4 is a bit reviled, but the truth is that it still held on pretty good compared to the other well designed architectures of the time)
By the way, this point and your other point on dynamic translation, Transmeta, etc. remind me of the recent buzz around Soft Machines. I don't think I've seen any discussions of Soft Machines here. In particular, I was reminded of Ian Cutress saying in this article that Soft Machines was more than ten years behind Intel on speculation and branch prediction: http://www.anandtech.com/show/10025/exa ... e-visc-ipc

That really struck me as illustrating how much accrued work and research has gone into a modern Intel CPU, and that an ISA might be a relatively minor piece.

While I have you, what goes into implementing new instructions? What's the physical manifestation of an instruction? Is there one, or is it typically just microcode? I'm always amazed by how more instructions constantly added, and that they consistently speed things up.


Author: Agner Date: 2016-08-12 00:11
Joe Duarte wrote:
what goes into implementing new instructions? What's the physical manifestation of an instruction? Is there one, or is it typically just microcode? I'm always amazed by how more instructions constantly added, and that they consistently speed things up.
My idea is not to use microcode in ForwardCom at all (and no microcode cache, of course).
Intel and AMD use microcode only for complicated instructions that use many µops, as microcode is quite inefficient. There are thousands of different instruction codes in x86 and most of them are implemented directly in silicon. That must be a lot of silicon! I am sure that this is slowing things down. Some of these instructions were added mainly for marketing reasons, providing no real advantage. Many instructions are obsolete, but they keep supporting them for backward compatibility.


Author: Hubert Lamontagne Date: 2016-08-12 18:13
Joe Duarte wrote:
what goes into implementing new instructions? What's the physical manifestation of an instruction? Is there one, or is it typically just microcode? I'm always amazed by how more instructions constantly added, and that they consistently speed things up.
Afaik, generally, there are '4 levels of implementation' for chips:

- FPGA: You write your logic in Verilog or VHDL, compile it to the FPGA, then run it.

- FPGA to ASIC conversion: This is basically an ASIC with a bunch of FPGA-like cells, but where the metal interconnection layer is customized. This has much lower fixed costs than an ASIC, but there's only a speed gain of about 30% and a cost gain of about 50% over FPGAs.

- ASIC: The Verilog / VHDL code is compiled to a proper circuit layout. Most chips use this. This makes much faster chips and has a much lower unit cost, but the fixed costs are very high ($1million+, need lots of units). There needs to be lots of verification.

- ASIC with additional custom-layout parts: Same as above but some parts are custom-laid out rather than automatically. AFAIK Intel does this for the most speed-critical parts of its CPUs (like the L1 Data Cache) and it is part of its advantage over AMD (I've read somewhere that AMD has stopped doing custom layout).

More info: http://electronics.stackexchange.com/qu ... -asic-made

Within the chip, instructions that have the same memory access pattern and latency often are interpreted more or less the same, with the ALU doing the different possible arithmetic operations in parallel and then having a multiplexer at the end that selects which result it uses. In Verilog, it looks something like this:

case (alu_op)
0: alu_out <= b;
1: alu_out <= a + b;
2: alu_out <= a - b;
3: alu_out <= a & b;
4: alu_out <= a | b;
5: alu_out <= a ^ b;
6: alu_out <= a >> b;
7: alu_out <= a << b;
8: alu_out <= a >>> b;
9: alu_out <= a >= b ? 1 : 0;
10: alu_out <= ($signed(a) >= $signed(b)) ? 1 : 0;
default: alu_out <= alu_out; endcase

Agner wrote:
My idea is not to use microcode in ForwardCom at all (and no microcode cache, of course).
Intel and AMD use microcode only for complicated instructions that use many µops, as microcode is quite inefficient. There are thousands of different instruction codes in x86 and most of them are implemented directly in silicon. That must be a lot of silicon! I am sure that this is slowing things down. Some of these instructions were added mainly for marketing reasons, providing no real advantage. Many instructions are obsolete, but they keep supporting them for backward compatibility.
Yeah, no microcode is a nice goal to strive for. Though that requires a certain amount of "instruction set discipline" : no long latency or multi-step or multi-µop operations, or operations that load/store an excessive number of registers or do multiple memory accesses, or special-case exceptions, or instructions that can have all sorts of strange side effects deep in the pipeline and have to be specially executed in-order (like CPU mode manipulation using the flag registers for instance)...


Author: grant galitz Date: 2016-08-22 22:53
Pardon my slight skimming of the contents of this thread.

I was wondering if adding the ability for the amount of registers visible to be user programmable, like THUMB on ARM, was overviewed prior to forwardcom's writing. I could see special branch instructions upgrading or downgrading the number of registers to fine tune the instruction caching/fetching vs work done per instruction as a still useful tool. I'm also curious as to the implication this would have on context switching overhead, and any hardware logic issues.


Author: Agner Date: 2016-08-22 23:59
grant galitz wrote:
I was wondering if adding the ability for the amount of registers visible to be user programmable, like THUMB on ARM, was overviewed prior to forwardcom's writing.
The instruction code template has 5 bits for each register operand. This gives 25 = 32 registers.

Adding more registers would make the instructions bigger. Fewer registers would be wasteful. If the number of registers is variable then all the software, including function libraries and system code has to be recompiled every time the number of registers is changed to optimally use the number of available registers. I think it would be too complicated to reconfigure register space as cache or something else. This would require a lot of extra wiring between otherwise unrelated parts of the chip.


Author: Hubert Lamontagne Date: 2016-08-24 00:33
grant galitz wrote:
Pardon my slight skimming of the contents of this thread.

I was wondering if adding the ability for the amount of registers visible to be user programmable, like THUMB on ARM, was overviewed prior to forwardcom's writing. I could see special branch instructions upgrading or downgrading the number of registers to fine tune the instruction caching/fetching vs work done per instruction as a still useful tool. I'm also curious as to the implication this would have on context switching overhead, and any hardware logic issues.
One reason ARM has a 16bit instruction mode (THUMB) is that not all ARM cores have an instruction cache. If you're loading instructions from RAM or slow ROM (sometimes with a 16bit bus), you really really want instructions to be as small as possible because it's so costly to load them up. THUMB mode was often used on the GBA when running code from the slower RAM or ROM, which had a 16bit bus. This is also why the SuperH has 16bit instructions as well.

Having an explicit switch between ARM and THUMB mode simplifies the pipeline because it removes the need to have a prefetch buffer to handle unaligned instructions. Later on, they added THUMB-2 mode which removed this limitation and would let the program mix up 16-bit and 32-bit instructions.

The catch with small instructions is that 16-bit THUMB code runs significantly slower than ARM on modern cores that have instruction caches and large, 64bit buses, since so many instructions need to be broken up. And also, THUMB-2 code doesn't run faster than ARM in most applications, in spite of taking less instruction cache. One guy comparing the linux kernel size and the Coremark benchmark has:

vmlinux-ARM32: text 8.4mb, data 463kb, bss 827kb, total 9.7mb
vmlinux-thumb2: text 6.7mb, data 463kb, bss 827kb, total 8.0mb

The best is -O3 -funroll-loops -marm -march=armv5te -mtune=cortex-a8
The best armv7-a is -O3 -funroll-loops -marm -march=armv7-a -mtune=cortex-a8 at 95.2 % of overall best
The best Thumb-2 is -O3 -funroll-loops -mthumb -march=armv7-a -mtune=cortex-a8 at 88.7% of overall best
The best Thumb-1 is -O2 -mthumb -march=armv5te -mtune=cortex-a8 at 64.4% of overall best

(Source)

So being able to lower the number of registers below 32 to shrink opcodes on a CPU design that's targetting large machines with lots of ram doesn't seem too relevant to me...

Now, increasing the number of registers is another question and I'm curious about what people have to say about that.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

ARM with scalable vector extensions

Post by agner » Thu Nov 02, 2017 12:44 pm

Author: Agner Date: 2016-08-23 00:24
Two people have sent me this link. Yesterday, ARM announced a future extension with variable vector length:
ARM Announces ARM v8-A with Scalable Vector Extensions http://www.anandtech.com/show/10586/arm ... ata-center

The details are not available yet, but it appears that the vector length must be a multiple of 128 bits. Maybe it must be a power of 2:
Technology Update: The Scalable Vector Extension (SVE) for the ARMv8-A architecture https://community.arm.com/groups/proces ... chitecture

This new ARM extension has the same idea as ForwardCom that the same code should be able to run on different processors with different vector length. But the hardware seems much more complicated than ForwardCom. The ARM extension requires that the hardware scheduler is able to split long vectors into shorter ones or merge short vectors into longer ones. The compiler would also be quite complicated, I think. The special ForwardCom loop type is much more elegant and efficient with no limit to vector length, see forwardcom.info.


Author: Jorcy Neto Date: 2016-08-23 06:11
Agner wrote:
But the hardware seems much more complicated than ForwardCom. The ARM extension requires that the hardware scheduler is able to split long vectors into shorter ones or merge short vectors into longer ones. The compiler would also be quite complicated, I think. The special ForwardCom loop type is much more elegant and efficient with no limit to vector length, see forwardcom.info.
Whereas splitting wider vector/SIMD instructions into narrower execution units is quite straightforward (just like early SSE/SSE2 in PIII/P4), merging narrower vectors for wider units really seems way more complex.


Author: Hubert Lamontagne Date: 2016-08-26 11:21
Agner wrote:
Two people have sent me this link. Yesterday, ARM announced a future extension with variable vector length:
ARM Announces ARM v8-A with Scalable Vector Extensions.


The details are not available yet, but it appears that the vector length must be a multiple of 128 bits. Maybe it must be a power of 2:
Technology Update: The Scalable Vector Extension (SVE) for the ARMv8-A architecture.


This new ARM extension has the same idea as ForwardCom that the same code should be able to run on different processors with different vector length. But the hardware seems much more complicated than ForwardCom.
Well, the ARM version is also somewhat more aggressive since it tries to deal with scatter-gather, carried loop dependencies (aka feedback) and data-dependent loop exits.

Agner wrote:
The ARM extension requires that the hardware scheduler is able to split long vectors into shorter ones or merge short vectors into longer ones.
Hmm, I'm re-reading the article, and now I'm really wondering what they actually mean about how it adapts to different vector sizes. Either:
- It's a software scheme (in which the software knows the vector size, and if you have, say, 512 bit vectors, then the top 1536 bits always read as 0) and the article is slightly wrong in the way it presents things.
- Or, vectors can always be 2048 bits and it simply uses more micro-ops on narrower processors (this is already how NEON works) - this relatively straightforwards to implement, although this leaves the issue of having to deal with oversized register files (I'm not up to date on how costly this is in hardware).
- Or maybe it's some other crazier scheme? I can't quite tell.

Agner wrote:
The compiler would also be quite complicated, I think.
Auto-vectorization is always complicated!

I'm not sure exactly why the ARM version is more complicated than ForwardCom though. Yes, the ARM version requires a more complex calculation to get vector length and turn it into lane flags, although there's a good chance this extra calculation can run in parallel with other stuff and essentially be free. The ARM version doesn't require transforming the loop index 0..n-1 into a negative index from end of buffer, which I'm sure is doable in SSA form but might get complex for loops with lots of references to the indexing variable.

Agner wrote:
The special ForwardCom loop type is much more elegant and efficient with no limit to vector length, see forwardcom.info.
It's true that the ARM version doesn't have the insight that if you use reverse indexing from the end of the buffer, you can clamp the index and use it as a count for the number of items per iteration. If I'm reading the news post correctly, I think the ARM idea is that they use the lane predication flags as vector sizes, so they probably need a couple of specialized instructions to produce the correct clamped item counts and masks. But they're doing both predication and variable vector size using the same hardware, which I guess is a good thing.

Maybe they're using general-purpose integer registers as lane masks - that would explain the 2048bit size limit (64bit register = 64 lane flags * 32bit float per line = 2048bits). The load/store unit needs to know the lane masks, so they need to be known early in the pipeline, and I don't think it's using too many general-purpose register file ports.


Author: Jorcy Neto Date: 2016-12-20 07:22
Agner wrote:
Two people have sent me this link. Yesterday, ARM announced a future extension with variable vector length:
ARM Announces ARM v8-A with Scalable Vector Extensions.

The details are not available yet, but it appears that the vector length must be a multiple of 128 bits. Maybe it must be a power of 2:
Most of the HotChips yearly presentations are made publicly available at each late year. Now you can find further details on SVE at :
http://www.hotchips.org/wp-content/uplo ... 51-v11.pdf

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for instruction set - now on Github

Post by agner » Thu Nov 02, 2017 12:49 pm

Author: Hubert Lamontagne Date: 2016-09-05 01:53
One recurrent question: How will ForwardCom run Linux's mmap function?


Author: Agner Date: 2016-09-05 05:12
You can map a virtual address space as unused, read only, write only, or read/write. In this way you can generate traps on first read and first write to a block of memory. If the mapped file is bigger than available RAM and you are writing in a random fashion then it will be more likely to flush a dirty part of the area to the file than to fragment the memory. The number of memory mapped files you can have is obviously limited by the size of the memory map.

I wouldn't recommend the use of memory mapped files with ForwardCom. Now that solid state disks are becoming popular (and bigger) there is hardly any point in using memory mapping for the purpose of minimizing disk head movements.


Author: Hubert Lamontagne Date: 2016-09-05 18:33
There's more to mmap:

- It's can be used to allocate memory, page by page (for instance as a back-end for malloc()).
- It can be used to memory map hardware devices. For instance, sometmes video device VRAM is mapped by mmap()ing /dev/mem at 0xa0000.
- It can be used for inter-procedure shared memory: if multiple programs mmap() the same file, they will actually get a mapping to the same block of RAM.
- The fork() function relies on paging to do a shallow copy of the whole memory space, and then selectively does a real deep copy of a page when one of the forks writes to it and triggers a page fault.
- Demand paging of newly allocated memory lets the OS deal with badly behaved memory allocators, for instance when the Java virtual machine allocates 512mb of ram on startup. Executable data of large shared libraries like QT can also be loaded on-demand this way.
- When RAM runs out, mmap()ed memory that is backed by disk files and doesn't need to be written to swap-space can be priorized.
- Also, mmap() also works well with SSDs.


Author: Agner Date: 2016-09-06 08:51
Hubert Lamontagne wrote:
There's more to mmap
I was not aware of these details.

Many of the applications of mmap you mention look like hacks in my opinion. A redesign of the memory management system should offer genuine solutions to these problems rather than hacks.

Malloc should be implemented in a way that matches the new memory management system.

Video ram should be accessed only through a device driver. This device driver may share video memory with an application only if the OS allows an application to monopolize the screen.

Sharing of memory between processes is described in the ForwardCom manual. It should be done in a standardized way for security reasons.

Demand paging and memory swapping to disk are separate problems that should have dedicated solutions without using mmap.


Author: Bigos Date: 2016-09-06 16:19
I don't think you will be able to remove all usages of mmap(). Memory mapping is very convenient and if a microarchitecture doesn't allow for it won't be used much for general purpose computing. Programmers are already familiar with it and created a lot of utilities that use it. For example, you couldn't port most of the general purpose OSes to such an architecture, since they are based on virtual memory (there is a special fork of linux called uclinux [1] that doesn't need an MMU, though it's specialized for embedded solutions).

For example, 3D graphics APIs have a memory buffer map functions that are used to map part of GPU memory into the process address space. Since this part is already exclusive to the application (it has been allocated previously), there is no harm in bypassing the device driver (which performs the map) and it lets the user perform reads and writes with higher performance. A solution with a shadow buffer that is copied around when needed is possible, but would reduce performance.

A mechanism used by the Mill CPU family uses a protection data that maps which memory is to be accessible by which application [2]. It's stored in a PLB (protection look-aside buffer) that, unlike TLB, is not on a critical path since it only has to prevent a failing writes from affecting other CPUs and make failing reads cause a fault - no address translation is performed. Maybe such a mechanism would be a good compromise here? The OS (or any other entity) would be able to let another one access its memory on demand.

Mill also has a memory translation table, but the TLB sits outside of the core, near a memory controller. The mapping is global, not per-process (no TLB flush on context switch needed) and does the translation only when an access goes to the memory, at which point the latency is already high so making the TLB fast is not really that important. All the data in the caches is virtually addressed, virtually tagged.

But I repeat myself here. We could come up with a better solution instead, though it doesn't appear that it will be simple.

Author: Hubert Lamontagne Date: 2016-09-06 22:41
Ok I checked, there are 2 OS calls to allocate RAM in Linux: sbrk() and mmap(). They serve as a back-end to malloc(). The "heap" segment shown on some Linux memory map diagrams is the section allocated by sbrk(), and the rest of user-allocated ram is allocated using mmap().

The current default implementation of malloc() is ptmalloc2 and it uses both sbrk() and mmap(), but in different scenarios:
- Small to medium allocations on the main thread use sbrk(), because it doesn't matter too much if some blocks get free()'d later on since the allocator can still reuse the leftover holes for allocations of the same size.
- Large allocations on the main thread use mmap(), because sbrk() can only reclaim free()'d memory in the exact reverse order that it was allocated, so that a large allocation can easily become impossible to reclaim if it's followed by small longer-lived allocations.
- Allocation on other threads use mmap() instead of sbrk(), because only mmap() is thread-safe. Also, using sbrk() would create a series of segments that are used by different threads on the heap, which would make it hard to reclaim memory until all the threads have finished running.


Author: Agner Date: 2016-09-07 00:10
Hubert Lamontagne wrote:
Ok I checked, there are 2 OS calls to allocate RAM in Linux: sbrk() and mmap().
Thanks for the clarification. sbrk would be easy to implement in ForwardCom. And it could be thread safe because each thread has its own memory map and its own private memory in ForwardCom. Everything that causes a heavy fragmentation of memory for the same thread is a no go. In some applications, the limitation can be circumvented by making multiple threads - one for each block of shared memory.


Author: Hubert Lamontagne Date: 2016-09-07 17:18
Agner wrote:
And it could be thread safe because each thread has its own memory map and its own private memory in ForwardCom.
Ah, but if your threads don't share memory, then they're not threads, they're separate processes, which is a different thing!


Author: Agner Date: 2016-09-08 00:39
Hubert Lamontagne wrote:
if your threads don't share memory, then they're not threads, they're separate processes, which is a different thing!
ForwardCom has a special security feature for threads. A process can have a parent thread and several child threads. All child threads have access to the memory of the parent thread. The stack, heap and thread-local storage of a child thread is private by default, not accessible to its parent and siblings. This can be useful if the process is servicing several clients. Communication between threads - e.g. a semafor - must use the parent's memory space.

Thread memory is shared only if this is necessary for compatibility with legacy software.


Author: Hubert Lamontagne Date: 2016-09-08 17:24
Agner wrote:
ForwardCom has a special security feature for threads. A process can have a parent thread and several child threads. All child threads have access to the memory of the parent thread. The stack, heap and thread-local storage of a child thread is private by default, not accessible to its parent and siblings. This can be useful if the process is servicing several clients. Communication between threads - e.g. a semafor - must use the parent's memory space.

Thread memory is shared only if this is necessary for compatibility with legacy software.
If you want to port anything that uses threads on a POSIX OS (which all use the pthreads library) or on Windows, the expected behavior is that all threads share the same memory yes. For instance, if multiple threads can modify a std::vector (protected by a mutex), then if the vector changes size and its memory is reallocated, all threads have to see the newly allocated memory regardless of which thread caused the size change. Since std::vector memory allocations ultimately use malloc(), this means that malloc() has to return memory which is visible to all threads by default.


Author: Commenter Date: 2016-09-07 19:28
I don't consider myself well informed on this topic, so please excuse me for any mistakes.
there is hardly any point in using memory mapping for the purpose of minimizing disk head movements
I've actually never heard of mmap being useful for that purpose? mmap is useful for reading files as it avoids an extra memory copy. When a file is mapped, the userland application essentially can read directly from the kernel's disk cache buffer. On the other hand, with standard read() calls, the kernel has to make an additional copy into the userland supplied buffer. This works similarly for write() calls.
As I/O gets faster, reducing this kind of overhead becomes more important.

In fact, I'd imagine that with faster NVRAM style devices, mmap may allow the file to be read directly off the I/O device without ever being read into system RAM.


Author: Bigos Date: 2016-09-08 13:04
Commenter wrote:
In fact, I'd imagine that with faster NVRAM style devices, mmap may allow the file to be read directly off the I/O device without ever being read into system RAM.
Actually, this is what Linux DAX infrastructure does. It allows you to map nvram directly to the process's address space.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Paging

Post by agner » Thu Nov 02, 2017 12:56 pm

Author: Kurt Baumgardner Date: 2016-09-09 23:40
I have received this mail from Kurt Baumgardner and I think it belongs on the forum so that others can join the discussion:

Hi Agner,

My name is Kurt Baumgardner, and I've been reading your processor design draft. Very impressive! Quite obviously a labor of love. You've done your research, which is also obvious from the rest of your site. I highly respect your work and fndings.

Please understand, I don't mean to be critical. But I had to mention my worry about the decision to drop paging from the memory model, which I think is going to be missed. In a handful of 'controversial' choices, you provided a logical rationale for going with one method vs. another, but, either I missed it, or you didn't go into detail about why no paging. I think it is related to your application memory map model, which, if I read correctly, is a simpler (and possibly tighter) app-based, "just what's needed" model. Such a model runs counter to a typical paging scheme, where most mallocs receive pages vs. arbitrary-sized blocks. I will be the first to admit that I must read that section a few more times And, even then, it's a deep subject to follow.

Regardless, I feel that, from what I understand so far, the memory model might be a bit too simple. You described an idea of the compiler calculating the needed stack size, and/or profiling to attempt to determine the needs. And, this is made easier with the 2 stack approach (which I really like). I understand this pre-calculation approach for optimization, but is its purpose also to help reduce complexities that would make your memory model seem less atractive? (I'm not sure if I described that understandably).

Let me take a step back. My biggest issue/fear/disappointment is with the removal of paging. It always nice to fit everything in memory (64Mb is enough for anyone, right? , but you can't always do it. With all the ideas of future compatibility, like with software automatically taking advantage of larger vectors, the loss of paging seems like a huge step backwards, as it seriously limits machines without large amounts of installed memory. I know of a small percentage of server apps that allocate memory with disregard for actual physical memory size, depending on and expecting the paging hardware and software to handle the otherwise messy details of managing extremely large datasets.

In my opinion, 2 of the most important inventions of computing are: #1: using free memory as a disk cache, and #2: built-in automatic paging. There's really no other practical way to provide apps the freedom to allocate memory as needed. Apps have no business trying to home bake their own paging scheme.

Without automatic paging, all the work you did for future compatibility is wasted. Why? Because developers will not write the big programs that could take advantage of future improvements, because they're afraid of running out of memory!

I know the goal is to simplify the system, for elegance, for cost, for power efficiency, for performance, and many other good reasons. But, your future chip must at least be able to support the hugely-successful and useful capabilities of today's processors. The loss of paging cripples the system severely as a whole.

Or, maybe I'm missing something huge. With the current spec, would you simply have to crash a launching (or running) program with an "Out-of-memory" error? Memory is handled so well currently that many programmers don't even test for the possibility of memory allocation to fail (which is awful, but also amazing that they can get away with it).

Thanks for your time reading, and, if you have a minute or two, please let me know what you think, or your rationales, or your thoughts. If you wish, this text could be placed into your forum. Have a nice day!

By the way, I *really* like the idea of programmer-built extensions to the instruction set. I would love to have the following capabilities:

1. Software gate connection (direct gate array access).
2a. Microcode programming (if so equipped), or
2b. the ability to control load/store, ALU, bit shift, cache control, vectoring, etc hardware.
3. An on-chip cache dedicated to user instructions. Could be used for extremely fast lookups/translations, pre-calculated math, etc.

(one can dream

Kurt


Author: Agner Date: 2016-09-10 00:18
Kurt Baumgardner wrote:
my worry about the decision to drop paging from the memory model,
The current systems have a very complicated memory paging systems with huge two-level page tables which are partially cached in the on-chip TLB. This system is complicated and using a lot of resources.

My goal with ForwardCom is to see what can be gained by a complete redesign. The paging system seemed an obvious candidate for simplification so I got the idea of replacing the huge number of fixed-size pages with a small number of variable-size pages. It was just an idea that I wanted to discuss, and as you can see it has already generated a lot of discussion.

This memory map system will only work if we can avoid heavy memory fragmentation, but if it works it will be a great advantage. The memory map will be so small that it can be contained entirely on the chip. You can have a separate memory map for each thread and you can easily switch memory map on system calls and task switches.

An application can still allocate memory dynamically, but preferably in a way that doesn't cause too much fragmentation. If an application allocates a huge amount of memory, not knowing how much it will actually need, then the OS may just allocate virtual memory in the first place. When physical memory is needed, it will be allocated in blocks of exponentially growing sizes (each new block will be bigger than the previous one) in order to minimize memory fragmentation. As a last resort, the system may move data around in memory if a contiguous virtual memory space has become mapped to a too fragmented physical memory. Moving data in memory may be fast because the data path to cache (and to main memory?) can be as wide as the maximum vector length.

The advantage of having variable size memory map entries is that you will need much fewer entries. The disadvantage is that you need a lot of comparators to check all entries in one or a few clock cycles. The actual number of entries we will have depends on the hardware implementation and this is subject to experimentation. A well behaved application that causes little or no memory fragmentation will need very few memory map entries. So a programmer who makes a performance-critical application should exercise a little discipline to avoid excessive memory fragmentation.


Author: Hubert Lamontagne Date: 2016-09-11 13:40
I guess the system is somewhat close to software TLB schemes like on the MIPS. The difference is, of course, that on software TLB schemes, pages must have 2^N sizes and be aligned. The advantage is that you don't need a full comparator per TLB entry and a Content-addressable memory is sufficient, which means a larger TLB is possible - MIPS classically had a 64 entry TLB.

Variable page size is not that well established as a technique: large pages don't work that well with copy-on-write techniques, they don't deal well with fragmentation of memory, larger pages tend to be split up by operations like mmap / munmap / mprotect... Generally, memory contiguity is a limited ressource, with contention not only at program-scale but also at global OS-scale, and the general response seems to have been to keep using 4k pages and increase the size of TLB caches.

When you're adding multiple page sizes, you're introducing some hard NP-complete problems to the OS's page allocator, since it can't just have a list of available free pages that it distributes to programs willy-nilly, and you have to have all sorts of page-contiguity preserving algorithms. And I'm pretty sure that it's often not practical to allocate memory by enlarging already allocated pages because this only works if the contiguous memory is unallocated, which becomes not only a program-scale problem but an OS-scale problem.

Moving data in memory is NOT a good idea IMHO, no matterr how fast you can copy. Especially for applications like games, which can use hundreds of mbs of RAM and CANNOT have 5ms and 10ms stalls.


Author: Kurt Baumgardner Date: 2016-09-13 00:12
I don't think it's possible to both be thrifty with memory, and prevent memory fragmentation, from an application programmer's point of view. You can just allocate a huge heap at the start of your program, and do all your work in that heap, thereby preventing fragmentation. Or you can just use what's currently needed, releasing memory back to the OS, but creating the potential for fragmentation. As a programmer, personally I don't want to have to think about how the OS gives me my memory, and I want the OS to make use of memory I've released. And, in today's OSes, the method for providing memory to programs is a magic black box - you ask, and the OS gives. You really don't have a lot of control of how that occurs.

It would be nice to be able to provide hints: "I need a temporary chunk of memory that I'll give back soon", or, "I'll be needing a contiguous chunk of X size for a while."

I assume we're talking about a replacement for X86/X64, for servers and desktops. I can symphathize with the goal of optimizing, and wanting to keep everything simple, and that's a noble cause. But, before making it optimal, it has to work first (and by that, I mean it has to be practical, and be able to handle today's apps.)

Consider a web browser. Navigate to a text page (or a forum), and that browser may need a couple of Mb of memory. But, then navigate to an image-laden page, and BAM! You now need dozens of Mb to render all those images into. Some browsers delay the rendering of JPGs, for example, until you actually scroll down to where the JPG is visible. And if you keep scrolling, they toss that rendering from memory.

My point is that, some programs have very complicated memory allocation needs, and that does not suggest that they are written poorly. Those web browser designers are also optimizing, but they are concerned with memory usage, load time, render speed, and much more. Forcing them to allocate memory a different way limits their abilities. And, without paging, they are forced to be thrifty with memory. Which causes fragmentation. Full circle.

If this chip is to sit in people's homes across the world, you've got to expect that many thousands of different programs will be running on them. And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.

In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe. And, even trying to be respectful of the machine's resources seems difficult, as I am torn between releasing no-longer-needed memory (which is supposed to be the right thing to do), and trying to prevent fragmentation (which is, essentially, the opposite, unless I've missed something).

Back to the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

2. If not #1, then some really good cache hint/directive instructions.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design. But, maybe one has merit, so I present them for thought. I find the whole project fascinating, and I hope one day that you can get chip to the fabrication stage! Best of luck!


Author: Agner Date: 2016-09-13 00:39
Kurt Baumgardner wrote:
And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.

In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe.
Some performance critical programs are written by highly skilled programmers who want to tweak as much performance out of the system as possible. Depending on the application, they may be able to predict how much memory they will need and allocate it all at once. Such applications would only need a handful of memory blocks which could easily be handled by an on-chip memory map. Other programs, probably less critical, are made with point-and-click tools and abstract frameworks by people who have no idea what is going on behind the scenes. They may cause heavy memory fragmentation without even knowing it. Maybe we can have a dual system with an on-chip memory map for well-behaved applications and a partially software-based paging system which will be activated only if the memory becomes too fragmented. Programmers who make well-behaved applications will be awarded with superior performance.
Back to the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

2. If not #1, then some really good cache hint/directive instructions.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design.
These ideas are not farfetched, and some have already been implemented in various systems. Swapping register banks was even supported in some processors back in the 1980s. Putting RAM on the chip is an obvious thing to do for the less memory-hungry applications. The more RAM you put on the chip the slower it will be, so you need one or more levels of cache in between anyway.


Author: Kurt Baumgardner Date: 2016-09-13 19:08
Agner wrote:
Kurt Baumgardner wrote:
And each program will be built by people with varying levels of skill. People who will not respect the beauty behind the memory organization scheme. And, you have to cater to them, because if their program brings the system to its knees, your chip and the OS will be blamed.

In other words, some crappy app should be able to disrespect the entire memory space...and your chip and OS should handle it with some grace, I believe.
Some performance critical programs are written by highly skilled programmers who want to tweak as much performance out of the system as possible. Depending on the application, they may be able to predict how much memory they will need and allocate it all at once. Such applications would only need a handful of memory blocks which could easily be handled by an on-chip memory map. Other programs, probably less critical, are made with point-and-click tools and abstract frameworks by people who have no idea what is going on behind the scenes. They may cause heavy memory fragmentation without even knowing it. Maybe we can have a dual system with an on-chip memory map for well-behaved applications and a partially software-based paging system which will be activated only if the memory becomes too fragmented. Programmers who make well-behaved applications will be awarded with superior performance.
Believe me, I like the idea of the super-fast memory map, and the dual system idea is interesting. I appreciate a system that remembers to give the programmer the keys to the kingdom, so to speak. Cause most systems provide a dumbed-down, common-denominator approach. But a dual system? That makes the system that much more complicated, doesn't it.? Maybe OS drivers and OS-protected low-level stuff could use the on-chip stuff exclusively. Otherwise, I'll be pushing to have my Super Zombie Crusher 3 pushing OS stuff out to let my 100-fps renderer live on-chip, right?!! What programmer would choose the slow model? I don't know - the more I think about it, the more I lean towards the traditional model. Yes, it's slow and complex. But, that's because it's necessary, to get the power you want and need. I think Intel got it right, somewhat...it's hard to deny. Simplifying it upfront leads to more complex schemes down the road. To me it seems inevitable. Now, please reconsider on-chip mass memory, which could alleviate this issue and others - see below.
Backto the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.

2. If not #1, then some really good cache hint/directive instructions.

3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.

4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.

5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.

I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design.
These ideas are not farfetched, and some have already been implemented in various systems. Swapping register banks was even supported in some processors back in the 1980s. Putting RAM on the chip is an obvious thing to do for the less memory-hungry applications. The more RAM you put on the chip the slower it will be, so you need one or more levels of cache in between anyway.
On the subject of massive on-chip memory: It's slow. Ok. But, how slow?. I really want to see the numbers on this one. Maybe it requires a new technology. But, let's assume it's possible. Think about it. Low fixed wait states. 0 cache misses. No complicated cache hardware - this is a big one. Or, that pipeline burst cache hardware instead force feeds the execution pipeline with instructions. Cache misses happen a lot. And, surely it's going to be faster than external memory. Is it expensive, money-wise? Is that why we don't see it being attempted Maybe you could spread the memory across multiple cores.

Some algorithms simply cannot avoid reading memory "vertically", such as 90-degree block rotation algortihms. These functions can suffer a cache invalidation on each read, unless proper cache hints are provided...yielding a needlessly slow read/write.

To me, cache misses and branch mis-prediction are the 2 things that prevent us from being able to optimize our code accurately, cause we cannot determine how much time each instruction will take, therefore we cannot pair instructions perfectly. Knowing exact memory timing allows the pipeline to be finely-tuned, I'd imagine. Removing the cache hardware provides fast chip real estate to put RAM, and simplifies the design enough to justify research into making it possible.

And, your memory map could sit in this same main memory too.


Author: Hubert Lamontagne Date: 2016-09-13 12:12
Kurt Baumgardner wrote:
Back to the subject of dreaming about my ideal CPU: These are not well thought out, or even practical, most likely. But they'd be nice Here goes:

1. 16Gb memory on the same chip as the CPU. I don't know what the limitations are, but, if all your PC's memory could be super-fast, on-chip, wow! Memory wait states slow down the CPU a lot, and, ,if it was all on the same chip, you could eliminate all the complex nightmare caching hardware.
http://www.zdnet.com/article/why-doesnt ... n-the-cpu/
http://electronics.stackexchange.com/qu ... e-cpu-chip
2. If not #1, then some really good cache hint/directive instructions.
Generally, this is done with instructions like PREFETCH on x86.
3. Instead of relying on branch prediction, why not take both branches, and provide the ability to swap pipelines to use the confirmed branch? This dual pipeline could be used for extra execution when branching was not occurring.
A lot of branch instructions that your CPU pipeline will see are in loops and generally branch the same way a lot of times in a row, and would actually get slower if you ran both sides, because rarely executed paths would be competing for resources. Here's a random C++ example:

Code: Select all

for(int i=0; i<200; i++) {
float sample = buffer[i];
mFilterMem += (sample - mFilterMem) * mFilterParam;
sample = mFilterMem;
if(mVolumeFade) {
sample *= mVolume;
mVolumeFade += mFadeRate;
}
buffer[i] = sample;
}
The branch at the end of the loop (i<200) will be taken 199 times in a row, and only skipped once. Speculatively executing the code after the loop makes no sense in this case. In addition to this, the branch in the middle of the loop (if(mVolumeFade)) will either always be taken or never be taken in the 200 iterations, so in that case speculatively executing both sides of the loop doesn't make sense either, and it's better to trust your branch predictor.

Speculatively executing both sides is generally something you'll only see in very large and complex out-of-order CPU cores, and I think it probably involves something like a 3 or 4 way branch predictor that predicts NO-JUMP/MAYBE-JUMP/YES-JUMP or NO-JUMP/MAYBE-NO-JUMP/MAYBE-YES-JUMP/YES-JUMP. I don't think it handles multi-way branch prediction too well either (ie jump to function pointer... some modern cores can handle multiple jump targets like this).
4. A hardware block move/fill/swap, page-based. Runs like a background job. An instruction could be used to test for completion by comparing a given address against each pending job's address range.
Sounds a lot like a DMA controller - and if I'm not mistaken, the x86 DMA controller can be used for that, but I don't think it runs fast enough to get a speed gain. And it kinda competes with the CPU for memory access cycles, which isn't too good, although I guess you could make it operate in the unused cycles.
5. Instead of saving registers on task switch, use an array of registers, indexed by taskid.
I'm sure I could think up some more. As I stated before, I realize that most of these are far-fetched, and do not really fit into Mr. Fog's design. But, maybe one has merit, so I present them for thought. I find the whole project fascinating, and I hope one day that you can get chip to the fabrication stage! Best of luck!

One that's worth it and that is used on ARM is to have 2 stack pointer registers, one for USER mode and one for OS mode, and switch when interrupts happen, because this makes interrupt handlers faster. Fast interrupt handlers are generally nice to have because then it makes stuff like software TLBs realistic, and it speeds up OS calls. Fast task switch between user programs is a bit less important because you're going to have tons of cache misses anyways, so not having to save/restore registers isn't as important (and a lot of programs can run at the same time, which makes it unlikely that you can hold all register sets at the same time).


Author: Kurt Baumgardner Date: 2016-09-14 16:56
Thanks, guys - great, informative, interesting answers. I wish I knew more about the architecture. I would like to come up with a way to mitigate the costs of a pipeline flush. It seems like such a shame of a waste when it happens. As good as modern branch prediction is, a lot of that benefit of smart prediction is thrown away. Maybe, while taking both branches, another predictor would realize that this wrong branch was being repeatedly taken, and would instead store an image of the pipeline that could be switched to when the misbranch was actually taken, and the other context could be flushed in the background.

I think I would pull my hair out trying to design such a thing. @Agner - I think you're on the right track avoiding microcode, keeping the design simple, and seeing just how well it can be made to perform. Best of luck.


Author: Hubert Lamontagne Date: 2016-09-11 17:54
Kurt Baumgardner wrote:
By the way, I *really* like the idea of programmer-built extensions to the instruction set. I would love to have the following capabilities:

1. Software gate connection (direct gate array access).
That sounds a lot like a CPU+FPGA combo... That would be a nice thing to have, I'd say the challenge would be that it's hard to share an FPGA between multiple programs. The usual strategy is to save/restore the state on task switch, but that's hard with a FPGA, since there's a lot of setup data, and it's hard to save/restore all the registers and block RAMs. You could try to share parts of the FPGA but that sounds very hard to do... though I guess you could split it in N units with a bunch of gates each, and each application would take up some number of units. Another model would be how video cards are shared.
2a. Microcode programming (if so equipped), or
That's a bit hard if most of the operations on your CPU are already single-microcode, like on a RISC. For a machine with combined memory+arithmetic instructions like x86 or ForwardCom, I guess you could make the combinations of memory-op + arithmetic-op user-configurable, or offer configurable multi-step ops like multiply-accumulate. You could also separate the address calculation and the load/store operation, as long as you don't add new register file writes/reads on the intermediary steps - you could have hidden extra registers that are only visible by sub-operations of your instruction, which wouldn't have to be renamed/loaded/stored to the main register file.
2b. the ability to control load/store, ALU, bit shift, cache control, vectoring, etc hardware.
VLIW cpus do something a lot like that. You basically specify which operation each part of the CPU is doing. They haven't really caught on yet because they have trouble dealing with things like memory store instructions needing to have the target address known early (so that program order in loads/stores can be preserved without requiring the compiler to do ultra-aggressive alias analysis) but the data to be written has to be known late (so that the CPU has time to calculate it).
3. An on-chip cache dedicated to user instructions. Could be used for extremely fast lookups/translations, pre-calculated math, etc.

(one can dream

Kurt
Hmmm... that might not be a bad idea, though it would make the issue hardware of the CPU appreciably more complex (so it would make branch mispredictions more costly). It would be useful to implement CISC instructions on a RISC base unit for instance... basically it would make a good cache for implementing the 2a proposition.


Author: Kurt Baumgardner Date: 2016-09-13 15:51
Hubert Lamontagne wrote:
Kurt Baumgardner wrote:
By the way, I *really* like the idea of programmer-built extensions to the instruction set. I would love to have the following capabilities:

1. Software gate connection (direct gate array access).
That sounds a lot like a CPU+FPGA combo... That would be a nice thing to have, I'd say the challenge would be that it's hard to share an FPGA between multiple programs. The usual strategy is to save/restore the state on task switch, but that's hard with a FPGA, since there's a lot of setup data, and it's hard to save/restore all the registers and block RAMs. You could try to share parts of the FPGA but that sounds very hard to do... though I guess you could split it in N units with a bunch of gates each, and each application would take up some number of units. Another model would be how video cards are shared.
I did not consider multiple programs wanting to use the FPGA - that *is* an ugly issue. Maybe it could be treated like a device, and programs would have to make a reservation, and obtain an exclusive lock for, at least, a portion of the array. I'm not sure there's a good answer there. But, you're onto something: It could be similar to the extra processing offered by GPUs. The reason it works as well as it does for GPUs is that, typically, only one program is using the GPU at any given time (when rendering games, which is, of course, the most visible use).


Author: Agner Date: 2016-09-14 00:45
Kurt Baumgardner wrote:
Believe me, I like the idea of the super-fast memory map, and the dual system idea is interesting.[...] What programmer would choose the slow model?
The OS will choose for you. If your application is well behaved and avoids memory fragmentation then you will use the on-chip memory map. If you allocate and deallocate huge amounts of memory randomly so that the memory becomes fragmented then the software based memory paging will be activated when the on-chip memory map is filled up. If the user opens and closes a lot of memory hungry applications so that there is no contiguous space to open the next program in then the OS may issue a warning and give the user the option to close some of the applications or run a garbage collector that moves some memory blocks around. The programmers will learn that if they want top performance they should exercise some discipline in memory allocation. I believe that most applications will be able to run without memory fragmentation.


Hubert Lamontagne wrote:
A lot of branch instructions that your CPU pipeline will see are in loops and generally branch the same way a lot of times in a row, and would actually get slower if you ran both sides, because rarely executed paths would be competing for resources. Here's a random C++ example
The branch predictor will know if a branch is easily predictable. It could select the branches with the worst prediction rate for executing both sides speculatively. In your example, an optimizing compiler may recognize that the inner branch is loop invariant and move it outside the loop, giving you one loop in each branch. The branch predictor should be able to get perfect prediction for a loop with constant count.


Kurt Baumgardner wrote:
16Gb memory on the same chip as the CPU
Hubert Lamontagne wrote:
This is an obvious thing to do. I believe that we will see a lot of on-chip RAM some day when technology allows it. It may be SRAM. Small system-on-a-chip microcontrollers already have both SRAM and Flash on the chip. ForwardCom may be implemented in this fashion for some dedicated applications.


Kurt Baumgardner wrote:
I *really* like the idea of programmer-built extensions to the instruction set
You may have an FPGA on the chip. An application should be able to monopolize this FPGA, or at least a part of it. I am trying to avoid microcode completely in ForwardCom.


Author: Jorcy Neto Date: 2016-09-18 13:55
Agner wrote:
Kurt Baumgardner wrote:
16Gb memory on the same chip as the CPU
Hubert Lamontagne wrote:
This is an obvious thing to do. I believe that we will see a lot of on-chip RAM some day when technology allows it. It may be SRAM. Small system-on-a-chip microcontrollers already have both SRAM and Flash on the chip. ForwardCom may be implemented in this fashion for some dedicated applications.
Actually, they've (Intel) already been studying the performance gains and power issues regarding logic+DRAM stacking quite since mid-2000's (see: "Die Stacking (3D) Microarchitecture" here http://www.cs.cmu.edu/afs/cs/Web/People ... lack06.pdf )

Locked