Agner wrote:
This looks quite elegant. The vector length is passed implicitly to the caller without any change to the ABI.My reason for using the same registers for floating point scalars and floating point vectors is that floating point code often contains calls to mathematical function libraries. Such a library can use variable-length vectors for parameters and results. The same library function can be called with a scalar or a vector because a scalar is simply handled as a vector with one element. This will make it easy for the compiler to vectorize the code. ForwardCom has a separate register set for integer scalars. These are useful for pointers, loop counters, array indexes, etc. that don't need vectors, and they are easier to save and restore than vector registers.
How do you handle conditionals, though? For instance:
Code: Select all
for (int i = 0; i < N; i++) {
if(a(i) != 0)
b(i) = sin(a(i));
}
Author: Agner Date: 2016-08-15 06:17
Sylvain Collange wrote:
Vector register r1-r7 can be used as masks. The normal solution for your example would be to calculate the sin function on the whole a vector and then merge the result, using a mask for (a(i) != 0).Does the ABI define a register as the "current mask" (GPU-like)? Or should the caller compact input vectors to have only contiguous elements, and expand the output vectors (vector processor-like)?
If you want to mask off some elements throughout the calculation of the sin function then the library function would need an extra parameter for the mask. (Some Intel math libraries actually have this). You wouldn't save any time by masking off elements during the calculation. The only reason I can see to have a mask on library functions is to avoid interrupts (traps) if floating point exceptions are enabled. The sin function wouldn't generate exceptions, but the logarithm function would in your case. The recommendation is to detect floating point errors as NAN's and INF's rather than using exceptions, because exceptions depend on the vector length. If you replace sin by log in your example then the log function would generate INF or NAN for non-positive values, and these values will be removed by the mask operation after the function call. If you insist on having floating point exceptions turned on, for whatever reason, then you might want a version of the log function with a mask or you could change the non-positive values before the call to the log function.
Compacting the vector, as you propose, might be useful if the vector is very sparse (only few of the elements are used) and you could compact several vectors into one before calling the function. However, compacting vector elements takes time so this would be useful only for very time-consuming functions.
Author: Sylvain Collange Date: 2016-08-15 07:35
Thanks. I was considering the more general case of vectorizing arbitrary functions, including functions with side effects. There is no reason to restrict function-level vectorization to a few hand-written math functions, is it?
As the compiler does not know in general whether a particular function has side effects or can raise exceptions, my point is that the mask could be part of the ABI, just as vector length (implicitly) is. This would enable vectorization-agnostic composition of function calls.
By the way, have you considered the micro-architecture and compiler implications of using a subset of vector registers as masks? It seems to introduce quite a lot of complexity:
- Instructions take up 5 input operands from vector registers (RS, RT, RU, Mask, RD). The old register value RD is required to merge the result.
Supporting 5-input instructions in a wide out-of-order superscalar is out of question with the current state of technology: executing 3 masked vectors instructions/clock would require 15 read ports in the SIMD register file, plus the ports needed for the load/store unit. Worse, making the operation dependent on the former value of the architectural destination register defeats the purpose of register renaming: write-after-read dependencies are still there.
A more acceptable implementation would break down each masked instruction into 2 µops: first a computation on the whole vector, then a merge (I believe that is how Knights Landing and Skylake handle masked AVX-512 instructions). This has a serious performance impact, and still does not completely eliminate write-after-read dependencies.
- The dependency with the old destination value and the merging operation could be avoided when unused lanes are set to 0 instead of the old value. Unfortunately, that information is encoded in the contents of the mask register itself, so the hardware does not know it early enough.
- More generally, masks encode control flow information, while vector registers encode data flow information. They are not needed at the same stages in the pipeline. Execution logic does not need mask (except as an optimization for saving power), and control logic does not need data. Encoding control flow information and data flow information in the same registers makes decoupling these stages much harder.
- From the compiler perspective, having some instructions that can only access a subset of registers makes register allocation much harder.
Author: Agner Date: 2016-08-15 10:53
Sylvain Collange wrote:
A function may have an unlimited number of input parameters and an unlimited number of outputs (through pointers or references or class members or whatever, all of which could in principle have different masks. That would add a lot of complexity to the ABI for something that is rarely needed. I prefer to specify a mask as an explicit function parameter in the rare cases that it is needed.my point is that the mask could be part of the ABI, just as vector length (implicitly) is. This would enable vectorization-agnostic composition of function calls.
A lot of instructions in current instruction sets are using certain registers for specific purposes. I don't think this is a big problem. It is a much bigger problem to have a separate set of registers for masks only, as AVX512 has, in my opinion.By the way, have you considered the micro-architecture and compiler implications of using a subset of vector registers as masks?
No, the first source operand is required to merge the result when masking. This happens to be the same as RD only if using an instruction format with too few register operands. This gives a maximum of 5 input dependencies. These are not required at the same stage in the pipeline. The execution stage has typically 2 or 3 input operands. A few unimportant optional instructions have 4 inputs. I added these instructions mainly for testing whether it is feasible to have four input operands.Instructions take up 5 input operands from vector registers (RS, RT, RU, Mask, RD). The old register value RD is required to merge the result.
The address calculation stage in the pipeline can have up to three input dependencies: base pointer, index and vector length (or block size). The architecture does not specify which stage in the pipeline reads the mask register. It could be somewhere between the address calculation stage and the execution stage if the number of register read ports is critical.
So, the total maximum number of input dependencies is five, but the maximum number of input dependencies at one stage in the pipeline can be limited to three if necessary. This is still a lot of complexity to handle for the scheduler, I agree. I have limited the number of output dependencies to one for the same reason. This limitation creates problems for add-with-carry and for integer overflow detection, but people have convinced me that it was more important to limit the number of output dependencies. x86 has two output dependencies on most instructions, including the flags and the floating point control/status registers.
Intel processors have traditionally had a limitation of two input dependencies for each micro-operation, but this has been increased to three in recent versions. AMD have never had any limitation on input dependencies. I don't know how expensive it is to have many read ports, but I prefer to have many read ports rather than splitting instructions into micro-operations. My idea is that a ForwardCom processor should not need to split instructions into micro-ops.
All superscalar processors handle this by renaming the register so that the input and output use two different physical registers.making the operation dependent on the former value of the architectural destination register defeats the purpose of register renaming: write-after-read dependencies are still there.
I have not been able to get my hands on any AVX-512 processor yet, so I am not able to test it. If you have any information on the microarchitecture of these processors then please give me a link.A more acceptable implementation would break down each masked instruction into 2 µops: first a computation on the whole vector, then a merge (I believe that is how Knights Landing and Skylake handle masked AVX-512 instructions).
The masking mechanism translates control flow into data flow. It does not change the dependency of the other input and output operands, as I explained above. If all elements of a vector are masked off then the hardware will still treat the input operand as a dependency and wait for it. You may want to convert this into a branch if the prediction rate is good.More generally, masks encode control flow information, while vector registers encode data flow information.
Author: Sylvain Collange Date: 2016-08-15 14:05
Agner wrote:
I do not see in which situation a vectorized program would need to pass multiple masks to a function. If we vectorize a loop with a function call inside, for each iteration i, either the function is called (bit mask 1) or not called (bit mask 0). Masks originate from the control flow of the original program. There is only one control flow regardless of the number of function parameters. So a single mask register is sufficient.A function may have an unlimited number of input parameters and an unlimited number of outputs (through pointers or references or class members or whatever, all of which could in principle have different masks. That would add a lot of complexity to the ABI for something that is rarely needed. I prefer to specify a mask as an explicit function parameter in the rare cases that it is needed.
The pipeline stage does not make a difference. If you want to sustain an execution rate of three 5-input instructions per clock cycle, you need the bandwidth to read 15 inputs per cycle on average from the register file. It does not matter in which order and at which pipeline stage they are read.So, the total maximum number of input dependencies is five, but the maximum number of input dependencies at one stage in the pipeline can be limited to three if necessary.
I agree that limiting the number of outputs is even more important. (Register file area scales with the square of the number of ports and power scales with its cube. But for read ports, you can always follow the brute-force approach of duplicating the register file and get a linear scaling.)
Yes, x86 instructions also output flags, but x86 CPUs use tricks to turn them into single-output instructions by having slightly wider physical registers, storing the flags in the physical destination register, and renaming the architectural flags register to the physical destination of the last instruction issued.
But input ports are also a bottleneck. Specializing registers allow to partition this complexity. Most ISAs use different registers for integer and FP, for instance.
Let's take an example:All superscalar processors handle this by renaming the register so that the input and output use two different physical registers.making the operation dependent on the former value of the architectural destination register defeats the purpose of register renaming: write-after-read dependencies are still there.
Non-masked code
mul_add.d v0, v1, v3, v4
mul_add.d v0, v5, v6, v7
Renaming maps each v0 operand to different physical registers. Assuming all inputs are ready and we have 2 FMA units with latency of 5 cycles, both instructions run in parallel and the code takes 5 cycles. So far so good.
Masked code
mul_add.d v0, v1, v3, v4, mask=v8
mul_add.d v0, v5, v6, v7, mask=v9
Renaming still maps v0 to different physical registers p0 and p1, but the second FMA now has a data dependency on the older value of v0, which is p0. Assuming a by-the-book OoO scheduler, instructions will be considered as dependent and the result will be only available after 10 cycles.
Now with the 2 µop option, the code becomes after expansion and renaming:
mul_add.d p0, v1, v3, v4
merge.d p1, v0, p0, mask=v8
mul_add.d p2, v5, v6, v7
merge.d p3, p1, p2, mask=v9
Both FMAs can start immediately and produce their results after 5 cycles. Then the first merge executes in say, 1 cycle, then the second dependent merge. Total latency is 7 cycles.
In any case, renaming was unable to completely reorder the two instructions.
I agree you could have a very smart scheduler that schedules instructions just 5 cycles before the mask becomes available so it is ready just in time. But it would add a lot of complexity in a critical part of the pipeline. Better break down the work into enough simple µops and let the out-of-order engine do the heavy lifting. Then you can spend resources you just saved into wider issue width and higher clocks.
At some point Intel had the Knights Landing software optimization guide on their website, but it seems they took it out.I have not been able to get my hands on any AVX-512 processor yet, so I am not able to test it. If you have any information on the microarchitecture of these processors then please give me a link.
About masks, they say:
Their formulation suggests they follow the single-instruction approach with naive scheduling. They do not appear to split in two µops after all. But if they write "much slower" they probably mean it...AVX-512 permits instructions to use masking. The best performance usually happens for instructions that do not use masking. Instructions that mask with zeroing usually have similar performance. Instructions that mask with merging can occasionally perform much slower, since there is an additional dependency introduced to the instruction. Keep in mind performance considerations when selecting the masking mode.
The takeaway is that the masking mode (zeroing or merging) should be available as early as possible in the pipeline, ideally at decode just by looking at the instruction. And the programmer/compiler should use zeroing whenever possible.
Author: Agner Date: 2016-08-15 23:46
Sylvain Collange wrote:
Could you have two separate sets of regiser read ports, one that is wired to the address calculation stage in the pipeline (general purpose registers only), and one that is wired to the execution stage?Agner wrote:The pipeline stage does not make a difference. If you want to sustain an execution rate of three 5-input instructions per clock cycle, you need the bandwidth to read 15 inputs per cycle on average from the register file. It does not matter in which order and at which pipeline stage they are read.So, the total maximum number of input dependencies is five, but the maximum number of input dependencies at one stage in the pipeline can be limited to three if necessary.
Interesting. I wonder where you have all this information from?Yes, x86 instructions also output flags, but x86 CPUs use tricks to turn them into single-output instructions by having slightly wider physical registers, storing the flags in the physical destination register, and renaming the architectural flags register to the physical destination of the last instruction issued.
No. Without zeroing, the mask chooses between the result and the first input operand, which is not the destination operand.Masked code
mul_add.d v0, v1, v3, v4, mask=v8
mul_add.d v0, v5, v6, v7, mask=v9
Renaming still maps v0 to different physical registers p0 and p1, but the second FMA now has a data dependency on the older value of v0, which is p0.
mul_add.d v0, v1, v3, v4, mask=v5
is the equivalent of:
Code: Select all
for (i=0; i < vectorlength; i++)
v0([i] = v5[i] ? v1[i] + v3[i]*v4[i] : v1[i];
You may as well interpret "much slower" as a consequence of the extra dependency which it has to wait for. I actually had the masking mode (zeroing or merging) coded in a bit in the instruction in an early draft of ForwardCom, but I needed this bit for other purposes and I was able to put it into the mask register because it doesn't change dependencies when merging with the first operand rather than with the destination.Their formulation suggests they follow the single-instruction approach with naive scheduling. They do not appear to split in two µops after all. But if they write "much slower" they probably mean it...AVX-512 permits instructions to use masking. The best performance usually happens for instructions that do not use masking. Instructions that mask with zeroing usually have similar performance. Instructions that mask with merging can occasionally perform much slower, since there is an additional dependency introduced to the instruction. Keep in mind performance considerations when selecting the masking mode.
The takeaway is that the masking mode (zeroing or merging) should be available as early as possible in the pipeline, ideally at decode just by looking at the instruction. And the programmer/compiler should use zeroing whenever possible.
Author: Agner Date: 2016-08-16 03:11
Actually, we could limit the maximum number of input dependencies to four if we make the rule that instructions with three input operands including a memory operand cannot have a mask. That would not be a serious limitation.
Author: Sylvain Collange Date: 2016-08-16 10:30
Correction: I meant WAW dependencies in my previous messages, not WAR dependencies.
Agner wrote:
If both sets of ports access the same physical array of registers, then the RF component still needs m+n ports, regardless of where its ports are connected to.Could you have two separate sets of regiser read ports, one that is wired to the address calculation stage in the pipeline (general purpose registers only), and one that is wired to the execution stage?
If you replicate the array itself to make two copies of the register file, then yes, it works: you get twice as many read ports for about twice the area. It does not work for write ports, though: you still need to perform all writebacks to both register files to keep their data coherent. Then another complication is the bypass network, which does not scale well with the number of ports either.
You may want to take advantage of the fact that mask registers are a subset of general-purpose registers to replicate only part of the array. It would work on an in-order processor, but as soon as you rename registers this property does not hold any more.
Unfortunately, academic papers seldom go down to that level of detail, so you have to microbenchmark, speculate, and read patents.Interesting. I wonder where you have all this information from?
For register renaming, here are the classic ones from the P6/K7 era:
https://www.google.com/patents/US6047369
https://www.google.com/patents/US5632023
Even on these 20 year-old architectures, you can see the renaming techniques are quite involved. (Thanks to the infamous INC/DEC instructions that only update a subset of the Flags register...)
Now I get it, thanks! Indeed, you need no extra input since you will be reading this register anyway.No. Without zeroing, the mask chooses between the result and the first input operand, which is not the destination operand.
I was expecting the semantics of masked instructions to be "pretend nothing happens on lanes with a 0 bit mask". In your ISA even 0-masked lanes have the side effect of a register move. This is unusual, but why not... Wouldn't that complicate register allocation, though? Do you have code examples showing how a compiler can perform if-conversion from some arbitrary code?
Also are there instruction encodings to chose between the result and the second operand (e.g. v0 = v5 ? v1 + v3 * v4 : v3; for a conditional Horner step) ?
Author: Agner Date: 2016-08-17 07:33
Sylvain Collange wrote:
I don't want to duplicate the register file, only avoid the quadratic increase you are talking about.Correction: I meant WAW dependencies in my previous messages, not WAR dependencies.
Agner wrote:If both sets of ports access the same physical array of registers, then the RF component still needs m+n ports, regardless of where its ports are connected to.Could you have two separate sets of regiser read ports, one that is wired to the address calculation stage in the pipeline (general purpose registers only), and one that is wired to the execution stage?
If you replicate the array itself to make two copies of the register file, then yes, it works.
Can you bypass a result to multiple in-flight instructions using the same bus? If so, you need only few bypass buses.Then another complication is the bypass network, which does not scale well with the number of ports either.
Do you have any references on how the register read ports are designed and why they are so expensive?Unfortunately, academic papers seldom go down to that level of detail, so you have to microbenchmark, speculate, and read patents.Interesting. I wonder where you have all this information from?
This is exactly the kind of cumbersome heritage that made me want to design a completely new instruction set. ForwardCom never modifies a part of a register only.Even on these 20 year-old architectures, you can see the renaming techniques are quite involved. (Thanks to the infamous INC/DEC instructions that only update a subset of the Flags register...)
That was never my plan. The mask value is not available at the time the register renamer assigns the output.I was expecting the semantics of masked instructions to be "pretend nothing happens on lanes with a 0 bit mask".
A branch in a vectorized loop is translated to a mask. All the good compilers can do this today.Do you have code examples showing how a compiler can perform if-conversion from some arbitrary code?
That was not my plan, but of course such an instruction could be defined if there is a need for it. But why would you have a conditional inside a polynomial?Also are there instruction encodings to chose between the result and the second operand (e.g. v0 = v5 ? v1 + v3 * v4 : v3; for a conditional Horner step) ?