An alternative to two register outputs for add-with-carry is to have extra carry bits in the register. We could, for example, have one extra flag bit for each 32 bits of vector registers. The extra bit can be used for carry, overflow, masks, etc. In this way we can implement add-with-carry with just two inputs and one output. Nobody needs add-with-carry for 8-bit and 16-bit integers. Maybe it is worth the cost of having 3% more bits of vector storage if we can avoid instructions with two outputs? But now we have a problem with saving the vector to memory. We will need two different save instructions - with and without the extra flag bits. This will cause complications in the compiler. And, do we need an extra bit on integer registers as well for the sake of orthogonality?
We can live without pop instructions. If we can handle the problems of an extra flag bit, then the only need for instructions with two outputs is read-and-increment-pointer. Without this, how do we save all registers in a task switch in an efficient and compact way? The extra bits will make it still more complicated to save registers for task switch. Any suggestions?
BTW, if we can accept extra bits in a vector register, we can also have extra bits for vector length. This will remove the input dependency on a vector-length register.
Author: Hubert Lamontagne Date: 2016-04-01 02:46
I've thought about the "assigning a flag register for each register" thing a bit, and its problem is not only that saving/restoring by the OS on interrupts, but also that it makes all callee-saved registers in the calling convention double-expensive - because the leaf function not only has to save the register value, but also its matching flags. You get a conflict between the principle that "flags aren't saved on function calls" (because they get wiped too easily and you pretty much never want to preserve them anyways) and the principle that "flags should have the same calling convention as their matching registers" (since they presumably come from the same register rename and update together). It also does all sorts of weird things like change 'nops' into different operations (mov r0, r0; add r0, #0; and r0, #-1; and the st r0, [sp+0] + ld r0, [sp+0] sequence all become different due to flag effects). Mov preserves flags (so that you can make 'mov' happen 'for free' by doing it at the register rename stage) but mov immediate doesn't. This is bad for late compiler optimization passes (because it has to take flags into account). So I think having flags be a part of value registers creates more problems than it solves.
If Bignums are really important and we need fast ADC for stuff like RSA encryption, I suggest we should make ADC/SBC vector-unit only, and have specific flags registers attached to the vector unit (to reduce the number of dependency paths from the SIMD unit to the main unit). Also, I'd separate the flags that are a result of the SIMD operations (ie carry) from the flags that control SIMD operations (zeroing, denormal-zeroing, etc), so that SIMD operations that update flags can simply wipe the previous value - Partial flag register updates are bad since it requires separate flag rename engine for every part that doesn't update together!
The exception to this is vector length (for variable vector length), which has to be on the integer-scalar unit because it has to be known by load/store instructions.
Ideally, it's best if SIMD instructions cannot cause interrupts and can't affect program flow. For most architectures, SIMD operations execute on a different unit with a different issue queue, so it's best if non-SIMD and SIMD operations can be separated as fast and easily as possible - basically right after determining instruction size, since operations go to different queues and compete for different register ports and so forth. In theory, you could even design a cpu with a separate IP and instruction cache for the SIMD unit, and do all SIMD loads/stores through a queue (the PS3's Cell is sorta like this, in a way).
For instance, on the ARM Cortex A8, NEON/FPU instructions literally CANNOT cause interrupts since the non-SIMD instruction results don't even commit at the same time, so SIMD instructions and every subsequent instruction have to be 100% sure to run (because the result of subsequent non-SIMD has already been committed so it cannot be undone). The benefit of this is that the non-SIMD commit unit doesn't even have to know that the SIMD unit even exists except for receiving store values in a queue, and the SIMD unit likewise knows nothing about the non-SIMD unit except that instructions and values loaded from memory and forwarded from GPRs arrive in a queue and that store values go in a queue.
X86 enforces this less strongly (so that committing has to be synchronized between the general-purpose unit and the SIMD unit) but even then, there's a reason why, on the Athlon, COMISS (SSE float compare and update main unit flags register) runs on the Vector Path (microcode!) - and it's not because AMD engineers thought people wouldn't use COMISS.
Basically, I don't think orthogonality between non-SIMD instructions and SIMD instructions is a good idea, since they have different goals: non-SIMD instructions have to have as few weird side effects as possible and retire as fast as possible, so that they can be renamed and reordered and jumbled and rolled-back if prediction has failed (or a load/store caused a page fault). SIMD instructions just have to do as much math as possible per cycle - they're all about throughput, so it doesn't matter if they take 4 or 5 cycles to complete, if they can't be reordered and so forth - which is why VLIW is popular in cpus designed for DSP (they don't have to run C++!).
SIMD-oriented code also tends to be more likely to be simple short loops so I don't think it really has to be particularly compact. Also the memory bandwidth usage for instructions will probably be totally dwarfed by the memory bandwidth usage for data in SIMD code anyways.
I don't think saving/restoring the register file on task switch using 32 consecutive loads/stores (+sp update) is THAT much of a problem because task switches cause other much slower side effects - for instance, you're likely to get a whole bunch of instruction cache misses and data cache misses and TLB evictions and TLB misses and branch prediction misses and the cache prefetcher getting confused - those are many times more costly.
To handle interrupts, you do need a few scratchpad registers that are only accessible to the OS, for saving the previous values of SP + IP + OS/user mode + a couple extra registers. This is to get work space to "bootstrap" the interrupt handler state save/restore. Early MIPS had the problem that it didn't really have those system-reserved special registers, so you unfortunately lost a couple of general purpose registers instead.
You also probably need the hardware TLB to have different memory mappings for OS and user and switch automatically between those. Another way to deal with this is having a few banked registers (typically SP and the last few registers - just enough to initiate state saving). Even though this makes interrupt handler prologues kinda ugly, it also removes the need for microcoded interrupt handlers (which are often somewhat bypassed by the OS anyways).
Author: Agner Date: 2016-04-01 03:49
Thank you for your comments. It is nice to have sparring partners to discuss with. We are really getting somewhere.
I think there are many advantages to storing the vector length in the vector register itself. This makes it much cheaper to save callee-save registers: You only have to save the part of the register that is actually used. I plan to make special instructions for save/restore. These instructions will save or restore the length and as much of the data as is actually used. The save instruction can increment a pointer by the actual amount of space used. The restore instruction will have to be followed by an extra instruction to increment or decrement a pointer by the amount of space used. The same instruction can be used to adjust the stack pointer before saving. This will save a lot of data cache. If a very long vector register actually contains only a floating point scalar then we need not save any more than that. If the vector register is unused, which will probably happen more than 50% of the time, then we only need to save the length, which is zero. Also, we don't need a complicated lazy save mechanism for task switches. And we get one less input dependence because we don't need a separate register for vector length.
The disadvantage is, of course, that the compiler needs to distinguish between saving the data of a vector and saving everything for a later complete restore. Once we have this distinction, however, there is little extra cost to also saving a few carry flags. You want to keep the carry bits separate from the control flags. I agree. My idea is to use the carry bits for overflow detection (so that we can avoid interrupts in vector instructions), and use a flags register, as already present in my initial proposal, for masking and for a lot of option bits that we don't have room for in the instruction code. A few of these option bits will determine the use of the carry flag, i.e. whether it is unused or it detects signed or unsigned integer overflow, floating point overflow or invalid operations or both, and whether it is propagated from input vector operands to output operands. All of these features are only activated if the software actually has a try/catch clause to detect overflows. Otherwise, the compiler doesn't need to care. But you want to avoid instructions with two outputs, and this is the best solution to that problem I can think of We can separate the carry/overflow bits from the control flags without generating any extra output dependence.
We can have an instruction to extract the carry bits to an integer bitfield. This will make it possible to do a fast carry-lookahead for a BIGNUM addition in a large vector register with just a few integer instructions.
I agree that integer registers should not have these extra bits.
The idea of having separate registers for system code only, sounds good. Would eight 64-bit registers be a suitable number?
I have fiddled a little with the instruction format. The coding gets simpler when we get rid of the vector length register and a few option bits. I think this makes it possible to make the instructions shorter so that we can avoid triple-size instructions. Of course, we will have to use two instructions for loading a 64-bit immediate value then, but I think we can live with that.
Author: Joe Duarte Date: 2016-04-02 02:09
Hubert and Agner, I have a related but slightly tangential question, since we're talking about TLBs.
In our applications, within a virtual memory space, why do we still use conventional -- and enormous -- memory addresses?
We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two. If the TLB is going to have to translate anyway, is it a problem to translate small addresses into conventional physical ones? Processes could have unique IDs - one byte would be ample most of the time - so the TLB would have a unique identifier for all addresses (process ID + the process' internal small memory addresses), and it would often take only two or three bytes total.
You could do a variety of things with pages and their sizes in that scenario. What am I missing here? I usually stick to managed languages.
Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture? Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then (I couldn't find them in nginx or Apache source, or browsers I checked, and they could really benefit.) That raises a separate issue in getting developers to pay attention to ISAs and optimization wins...
Author: Agner Date: 2016-04-02 03:03
Joe Duarte wrote:
Most operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.why do we still use conventional -- and enormous -- memory addresses?
We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two.
I don't know if a JIT compiler needs anything special. Maybe string compare, but we can do that with vectors of 8-bit elements (this will work with UTF-8 strings). Anything else you have in mind for JIT compilers?Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture?
My proposal includes variable-length vector registers that enable the software to adapt automatically to the different vector sizes of different processors without recompiling. If one compiled executable file fits all variants of the processor, why do we need JIT compilers at all?
Interpreters for byte code languages have a lot of jump tables or tables of function pointers. My proposal has an instruction for efficient handling of jump/call tables of 32-bit pointers relative to an arbitrary reference point.
The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then
Author: Joe Duarte Date: 2016-04-02 17:09
Am I correct in recalling that x86-64 doesn't actually expose a 64-bit address space, but rather a 48-bit one? See http://stackoverflow.com/questions/6716 ... ress-spaceMost operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.
However, this doesn't matter for my purposes. I'm asking why, in virtual memory, we mirror the physical memory addressing scheme. Why does a process use an address like 0x7fffd1510060 when it could use and address like 1 or 2 or D4? It's the process' exclusive virtual memory space – wouldn't it save a lot of memory if it could use one-byte pointers? I image that the TLB or MMU can translate these virtual addresses just as easily as it translates 0x7fffd1510060.
I've also wondered if we could use time stamps as virtual memory addresses – start the clock at zero nanoseconds for each process, every allocation marked by the nanoseconds transpired since the start of the process, each process having its own little Unix epoch if you will. This would also be more compact than the status quo, given some simple truncation and compression techniques. Time-stamped allocations might also be useful for a capabilities-based system, like the CHERI ISA at Cambridge or the Barrelfish OS that Microsoft and ETH Zurich have worked on.
The other big thing JITs and many non-JIT runtimes have to do is garbage collection. I think it's well worth thinking about how an ISA could be designed to optimize garbage collection. There are some papers out there on hardware-accelerated garbage collection, but I haven't seen anyone model how an ISA's design decisions could help (or hurt) garbage collection.
Now that I think about it, you really ought to be involved in that project. They would benefit from your input. Relatedly, the web standards authorities and browser makers have been working on SIMD.JS, which I think would also benefit from your insights. I'm surprised they haven't asked for your help (if in fact they haven't). https://developer.mozilla.org/en-US/doc ... jects/SIMD
I don't think text length is the issue. The new instructions are mostly designed to parse XML (and would work just as well for parsing any kind of structured text, HTML, even bytecode depending on some particulars.) From one of the Intel papers:The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.
(From https://software.intel.com/en-us/articl ... intel-sse4)XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules. The main work of Intel XML parsing is to recognize these entities and their logic structures.
From Intel XML Parsing Accelerator, we found that character checking loop occupies more than 60% CPU cycles of the whole parsing process, depending on the property of benchmark. There are two kinds of important behavior in this loop, read bytes and check whether it is legal for its corresponding entity type. Without any parallel instructions for string comparison, this process must be implemented in serializing mode.
I think the instructions would be useful for superfast user agent detection in web servers. I think PCMPESTRI and the other instructions work with 16-byte strings, and you could probably take a 16-byte chunk of a certain area of the user agent string that would uniquely identify the key factors you cared about across all user agents, like mobile or not, specific browser and version (which could, for example, tell you if you could use the vectors in SIMD.JS because you'd know which browsers support it.) The web is too slow, and I think common web servers, applications, and databases would be much faster if they used modern CPU instructions in their code.
(Example user agent string: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36)
Author: Agner Date: 2016-04-02 04:26
I have found a way to do addition of very large integers without the need for special carry bits:
- Put the two big numbers into vector registers A and B, or as much of the numbers that will fit into the maximum vector length
- Calculate the sums of all 64-bit vector elements: C = A + B
- Make a vector of the carries, one bit in each 64-bit element: D = (C < A)
- Find elements with all ones: E = (C == -1)
- Combine bit 0 of each element of C into an integer bitfield: D1 = convert_boolean_vector_to_bitfield(D)
- Do the same with E: E1 = convert_boolean_vector_to_bitfield(E)
- Use integer operations to do the carry look-ahead. D1 is the elements that generate carry, E1 is the elements that propagate carry. F = E1 xor (E1 + 2D1 + carry_in)
- Convert back from bitfield to vector: F1 = convert_bitfield_to_boolean_vector(F)
- Add the propagated carries: Sum = C + F1
- Carry out if more rounds needed: Carry_out = F >> number_of_vector_elements
If we don't have the extra carry flags, then we will need special instructions for checking if an addition, multiplication, etc. overflows. After each addition or other arithmetic instruction, issue another instruction with the same inputs just to check if the operation overflows. The outputs of all the overflow checks for a series of calculations should then be OR'ed together. The total number of instructions will be three times the number of instructions needed without overflow check.
So what are the costs and benefits? Without the extra flag bits we need three times as many instructions if we want to check for integer overflow. We can avoid overflow in many cases by using 64-bit integers, but even 64-bit integers can easily overflow if the calculation has many multiplications. Compiler support for checking integer overflow is rare, unfortunately, so any mechanism for detecting integer overflow will perhaps not be used much. On the other hand, if an efficient method was available, we would probably see compilers that support it and programmers that use it. In fact, many programmers are frustrated over how difficult it is to detect signed integer overflow. Traps for integer overflow exception is not good for vector code. One solution is to use floating point calculations instead of integer, but I don't see that solution used much. Floating point calculations have longer latencies, of course. No solution seems to be really good here.