The original proposal, 2 years ago

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Do we need instructions with two outputs?

Post by agner » Thu Nov 02, 2017 9:35 am

Author: Agner Date: 2016-03-31 01:55
An alternative to two register outputs for add-with-carry is to have extra carry bits in the register. We could, for example, have one extra flag bit for each 32 bits of vector registers. The extra bit can be used for carry, overflow, masks, etc. In this way we can implement add-with-carry with just two inputs and one output. Nobody needs add-with-carry for 8-bit and 16-bit integers. Maybe it is worth the cost of having 3% more bits of vector storage if we can avoid instructions with two outputs? But now we have a problem with saving the vector to memory. We will need two different save instructions - with and without the extra flag bits. This will cause complications in the compiler. And, do we need an extra bit on integer registers as well for the sake of orthogonality?

We can live without pop instructions. If we can handle the problems of an extra flag bit, then the only need for instructions with two outputs is read-and-increment-pointer. Without this, how do we save all registers in a task switch in an efficient and compact way? The extra bits will make it still more complicated to save registers for task switch. Any suggestions?

BTW, if we can accept extra bits in a vector register, we can also have extra bits for vector length. This will remove the input dependency on a vector-length register.


Author: Hubert Lamontagne Date: 2016-04-01 02:46
I've thought about the "assigning a flag register for each register" thing a bit, and its problem is not only that saving/restoring by the OS on interrupts, but also that it makes all callee-saved registers in the calling convention double-expensive - because the leaf function not only has to save the register value, but also its matching flags. You get a conflict between the principle that "flags aren't saved on function calls" (because they get wiped too easily and you pretty much never want to preserve them anyways) and the principle that "flags should have the same calling convention as their matching registers" (since they presumably come from the same register rename and update together). It also does all sorts of weird things like change 'nops' into different operations (mov r0, r0; add r0, #0; and r0, #-1; and the st r0, [sp+0] + ld r0, [sp+0] sequence all become different due to flag effects). Mov preserves flags (so that you can make 'mov' happen 'for free' by doing it at the register rename stage) but mov immediate doesn't. This is bad for late compiler optimization passes (because it has to take flags into account). So I think having flags be a part of value registers creates more problems than it solves.

If Bignums are really important and we need fast ADC for stuff like RSA encryption, I suggest we should make ADC/SBC vector-unit only, and have specific flags registers attached to the vector unit (to reduce the number of dependency paths from the SIMD unit to the main unit). Also, I'd separate the flags that are a result of the SIMD operations (ie carry) from the flags that control SIMD operations (zeroing, denormal-zeroing, etc), so that SIMD operations that update flags can simply wipe the previous value - Partial flag register updates are bad since it requires separate flag rename engine for every part that doesn't update together!

The exception to this is vector length (for variable vector length), which has to be on the integer-scalar unit because it has to be known by load/store instructions.

-

Ideally, it's best if SIMD instructions cannot cause interrupts and can't affect program flow. For most architectures, SIMD operations execute on a different unit with a different issue queue, so it's best if non-SIMD and SIMD operations can be separated as fast and easily as possible - basically right after determining instruction size, since operations go to different queues and compete for different register ports and so forth. In theory, you could even design a cpu with a separate IP and instruction cache for the SIMD unit, and do all SIMD loads/stores through a queue (the PS3's Cell is sorta like this, in a way).

For instance, on the ARM Cortex A8, NEON/FPU instructions literally CANNOT cause interrupts since the non-SIMD instruction results don't even commit at the same time, so SIMD instructions and every subsequent instruction have to be 100% sure to run (because the result of subsequent non-SIMD has already been committed so it cannot be undone). The benefit of this is that the non-SIMD commit unit doesn't even have to know that the SIMD unit even exists except for receiving store values in a queue, and the SIMD unit likewise knows nothing about the non-SIMD unit except that instructions and values loaded from memory and forwarded from GPRs arrive in a queue and that store values go in a queue.

X86 enforces this less strongly (so that committing has to be synchronized between the general-purpose unit and the SIMD unit) but even then, there's a reason why, on the Athlon, COMISS (SSE float compare and update main unit flags register) runs on the Vector Path (microcode!) - and it's not because AMD engineers thought people wouldn't use COMISS.

Basically, I don't think orthogonality between non-SIMD instructions and SIMD instructions is a good idea, since they have different goals: non-SIMD instructions have to have as few weird side effects as possible and retire as fast as possible, so that they can be renamed and reordered and jumbled and rolled-back if prediction has failed (or a load/store caused a page fault). SIMD instructions just have to do as much math as possible per cycle - they're all about throughput, so it doesn't matter if they take 4 or 5 cycles to complete, if they can't be reordered and so forth - which is why VLIW is popular in cpus designed for DSP (they don't have to run C++!).

SIMD-oriented code also tends to be more likely to be simple short loops so I don't think it really has to be particularly compact. Also the memory bandwidth usage for instructions will probably be totally dwarfed by the memory bandwidth usage for data in SIMD code anyways.

-

I don't think saving/restoring the register file on task switch using 32 consecutive loads/stores (+sp update) is THAT much of a problem because task switches cause other much slower side effects - for instance, you're likely to get a whole bunch of instruction cache misses and data cache misses and TLB evictions and TLB misses and branch prediction misses and the cache prefetcher getting confused - those are many times more costly.

To handle interrupts, you do need a few scratchpad registers that are only accessible to the OS, for saving the previous values of SP + IP + OS/user mode + a couple extra registers. This is to get work space to "bootstrap" the interrupt handler state save/restore. Early MIPS had the problem that it didn't really have those system-reserved special registers, so you unfortunately lost a couple of general purpose registers instead.

You also probably need the hardware TLB to have different memory mappings for OS and user and switch automatically between those. Another way to deal with this is having a few banked registers (typically SP and the last few registers - just enough to initiate state saving). Even though this makes interrupt handler prologues kinda ugly, it also removes the need for microcoded interrupt handlers (which are often somewhat bypassed by the OS anyways).


Author: Agner Date: 2016-04-01 03:49
Thank you for your comments. It is nice to have sparring partners to discuss with. We are really getting somewhere.

I think there are many advantages to storing the vector length in the vector register itself. This makes it much cheaper to save callee-save registers: You only have to save the part of the register that is actually used. I plan to make special instructions for save/restore. These instructions will save or restore the length and as much of the data as is actually used. The save instruction can increment a pointer by the actual amount of space used. The restore instruction will have to be followed by an extra instruction to increment or decrement a pointer by the amount of space used. The same instruction can be used to adjust the stack pointer before saving. This will save a lot of data cache. If a very long vector register actually contains only a floating point scalar then we need not save any more than that. If the vector register is unused, which will probably happen more than 50% of the time, then we only need to save the length, which is zero. Also, we don't need a complicated lazy save mechanism for task switches. And we get one less input dependence because we don't need a separate register for vector length.

The disadvantage is, of course, that the compiler needs to distinguish between saving the data of a vector and saving everything for a later complete restore. Once we have this distinction, however, there is little extra cost to also saving a few carry flags. You want to keep the carry bits separate from the control flags. I agree. My idea is to use the carry bits for overflow detection (so that we can avoid interrupts in vector instructions), and use a flags register, as already present in my initial proposal, for masking and for a lot of option bits that we don't have room for in the instruction code. A few of these option bits will determine the use of the carry flag, i.e. whether it is unused or it detects signed or unsigned integer overflow, floating point overflow or invalid operations or both, and whether it is propagated from input vector operands to output operands. All of these features are only activated if the software actually has a try/catch clause to detect overflows. Otherwise, the compiler doesn't need to care. But you want to avoid instructions with two outputs, and this is the best solution to that problem I can think of :-) We can separate the carry/overflow bits from the control flags without generating any extra output dependence.

We can have an instruction to extract the carry bits to an integer bitfield. This will make it possible to do a fast carry-lookahead for a BIGNUM addition in a large vector register with just a few integer instructions.

I agree that integer registers should not have these extra bits.

The idea of having separate registers for system code only, sounds good. Would eight 64-bit registers be a suitable number?

I have fiddled a little with the instruction format. The coding gets simpler when we get rid of the vector length register and a few option bits. I think this makes it possible to make the instructions shorter so that we can avoid triple-size instructions. Of course, we will have to use two instructions for loading a 64-bit immediate value then, but I think we can live with that.


Author: Joe Duarte Date: 2016-04-02 02:09
Hubert and Agner, I have a related but slightly tangential question, since we're talking about TLBs.

In our applications, within a virtual memory space, why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two. If the TLB is going to have to translate anyway, is it a problem to translate small addresses into conventional physical ones? Processes could have unique IDs - one byte would be ample most of the time - so the TLB would have a unique identifier for all addresses (process ID + the process' internal small memory addresses), and it would often take only two or three bytes total.

You could do a variety of things with pages and their sizes in that scenario. What am I missing here? I usually stick to managed languages.

Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture? Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then (I couldn't find them in nginx or Apache source, or browsers I checked, and they could really benefit.) That raises a separate issue in getting developers to pay attention to ISAs and optimization wins...


Author: Agner Date: 2016-04-02 03:03
Joe Duarte wrote:
why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two.
Most operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.
Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture?
I don't know if a JIT compiler needs anything special. Maybe string compare, but we can do that with vectors of 8-bit elements (this will work with UTF-8 strings). Anything else you have in mind for JIT compilers?

My proposal includes variable-length vector registers that enable the software to adapt automatically to the different vector sizes of different processors without recompiling. If one compiled executable file fits all variants of the processor, why do we need JIT compilers at all?

Interpreters for byte code languages have a lot of jump tables or tables of function pointers. My proposal has an instruction for efficient handling of jump/call tables of 32-bit pointers relative to an arbitrary reference point.
Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then
The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.


Author: Joe Duarte Date: 2016-04-02 17:09
Agner wrote:
Most operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.
Am I correct in recalling that x86-64 doesn't actually expose a 64-bit address space, but rather a 48-bit one? See http://stackoverflow.com/questions/6716 ... ress-space

However, this doesn't matter for my purposes. I'm asking why, in virtual memory, we mirror the physical memory addressing scheme. Why does a process use an address like 0x7fffd1510060 when it could use and address like 1 or 2 or D4? It's the process' exclusive virtual memory space – wouldn't it save a lot of memory if it could use one-byte pointers? I image that the TLB or MMU can translate these virtual addresses just as easily as it translates 0x7fffd1510060.

I've also wondered if we could use time stamps as virtual memory addresses – start the clock at zero nanoseconds for each process, every allocation marked by the nanoseconds transpired since the start of the process, each process having its own little Unix epoch if you will. This would also be more compact than the status quo, given some simple truncation and compression techniques. Time-stamped allocations might also be useful for a capabilities-based system, like the CHERI ISA at Cambridge or the Barrelfish OS that Microsoft and ETH Zurich have worked on.

Agner wrote:
I don't know if a JIT compiler needs anything special. Maybe string compare, but we can do that with vectors of 8-bit elements (this will work with UTF-8 strings). Anything else you have in mind for JIT compilers?
Parsing, parsing, and more parsing (and lexing). I'm not sure that processor and ISA designers have thoroughly explored how parsing performance might be improved. And of course the actual compilation to machine code. JITs have to make hard trade-offs with respect to generating maximally optimized code vs. generating code quickly. They have to forsake some of the optimizations that a static C/C++ compiler would provide. (Relatedly, you might find this interesting. Apple recently added LLVM as last stage compiler for their WebKit/Safari JavaScript JIT, but more recently replaced it with a completely new compiler. Very interesting deep dive here: https://webkit.org/blog/5852/introducin ... -compiler/)

The other big thing JITs and many non-JIT runtimes have to do is garbage collection. I think it's well worth thinking about how an ISA could be designed to optimize garbage collection. There are some papers out there on hardware-accelerated garbage collection, but I haven't seen anyone model how an ISA's design decisions could help (or hurt) garbage collection.

Agner wrote:
My proposal includes variable-length vector registers that enable the software to adapt automatically to the different vector sizes of different processors without recompiling. If one compiled executable file fits all variants of the processor, why do we need JIT compilers at all?
We need them for the web, for JavaScript. This will be the case for many years to come. And for REPL and interpreter-like execution environments. Julia comes to mind the most.

WebAssembly is rolling out later this year. It looks excellent and thankfully everyone is behind it: Microsoft, Mozilla, Google, and perhaps Apple. It will be a partly compiled bytecode that browsers can execute much, much faster than JavaScript. However it won't replace JavaScript, as it's meant more for games, multimedia, and compute-intensive workloads. For now, the only source languages supported are C and C++, but more are expected. https://github.com/WebAssembly/design

Now that I think about it, you really ought to be involved in that project. They would benefit from your input. Relatedly, the web standards authorities and browser makers have been working on SIMD.JS, which I think would also benefit from your insights. I'm surprised they haven't asked for your help (if in fact they haven't). https://developer.mozilla.org/en-US/doc ... jects/SIMD

Agner wrote:
The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.
I don't think text length is the issue. The new instructions are mostly designed to parse XML (and would work just as well for parsing any kind of structured text, HTML, even bytecode depending on some particulars.) From one of the Intel papers:
XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules.[1] The main work of Intel XML parsing is to recognize these entities and their logic structures.

From Intel XML Parsing Accelerator, we found that character checking loop occupies more than 60% CPU cycles of the whole parsing process, depending on the property of benchmark. There are two kinds of important behavior in this loop, read bytes and check whether it is legal for its corresponding entity type. Without any parallel instructions for string comparison, this process must be implemented in serializing mode.
(From https://software.intel.com/en-us/articl ... intel-sse4)

I think the instructions would be useful for superfast user agent detection in web servers. I think PCMPESTRI and the other instructions work with 16-byte strings, and you could probably take a 16-byte chunk of a certain area of the user agent string that would uniquely identify the key factors you cared about across all user agents, like mobile or not, specific browser and version (which could, for example, tell you if you could use the vectors in SIMD.JS because you'd know which browsers support it.) The web is too slow, and I think common web servers, applications, and databases would be much faster if they used modern CPU instructions in their code.

(Example user agent string: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36)

Cheers,

Joe D.


Author: Agner Date: 2016-04-02 04:26
I have found a way to do addition of very large integers without the need for special carry bits:
  • Put the two big numbers into vector registers A and B, or as much of the numbers that will fit into the maximum vector length
  • Calculate the sums of all 64-bit vector elements: C = A + B
  • Make a vector of the carries, one bit in each 64-bit element: D = (C < A)
  • Find elements with all ones: E = (C == -1)
  • Combine bit 0 of each element of C into an integer bitfield: D1 = convert_boolean_vector_to_bitfield(D)
  • Do the same with E: E1 = convert_boolean_vector_to_bitfield(E)
  • Use integer operations to do the carry look-ahead. D1 is the elements that generate carry, E1 is the elements that propagate carry. F = E1 xor (E1 + 2D1 + carry_in)
  • Convert back from bitfield to vector: F1 = convert_bitfield_to_boolean_vector(F)
  • Add the propagated carries: Sum = C + F1
  • Carry out if more rounds needed: Carry_out = F >> number_of_vector_elements
If we don't need extra carry bits for high precision arithmetics, then the only need for these extra bits is for detecting and propagating integer overflow.

If we don't have the extra carry flags, then we will need special instructions for checking if an addition, multiplication, etc. overflows. After each addition or other arithmetic instruction, issue another instruction with the same inputs just to check if the operation overflows. The outputs of all the overflow checks for a series of calculations should then be OR'ed together. The total number of instructions will be three times the number of instructions needed without overflow check.

So what are the costs and benefits? Without the extra flag bits we need three times as many instructions if we want to check for integer overflow. We can avoid overflow in many cases by using 64-bit integers, but even 64-bit integers can easily overflow if the calculation has many multiplications. Compiler support for checking integer overflow is rare, unfortunately, so any mechanism for detecting integer overflow will perhaps not be used much. On the other hand, if an efficient method was available, we would probably see compilers that support it and programmers that use it. In fact, many programmers are frustrated over how difficult it is to detect signed integer overflow. Traps for integer overflow exception is not good for vector code. One solution is to use floating point calculations instead of integer, but I don't see that solution used much. Floating point calculations have longer latencies, of course. No solution seems to be really good here.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Do we need instructions with two outputs?

Post by agner » Thu Nov 02, 2017 9:37 am

Author: Hubert Lamontagne Date: 2016-04-02 14:13
Agner wrote:
Thank you for your comments. It is nice to have sparring partners to discuss with. We are really getting somewhere.
Ha yeah. "Sparring partner" is a great way to put it. :3
I think there are many advantages to storing the vector length in the vector register itself. This makes it much cheaper to save callee-save registers: You only have to save the part of the register that is actually used. I plan to make special instructions for save/restore. These instructions will save or restore the length and as much of the data as is actually used. The save instruction can increment a pointer by the actual amount of space used. The restore instruction will have to be followed by an extra instruction to increment or decrement a pointer by the amount of space used. The same instruction can be used to adjust the stack pointer before saving. This will save a lot of data cache. If a very long vector register actually contains only a floating point scalar then we need not save any more than that. If the vector register is unused, which will probably happen more than 50% of the time, then we only need to save the length, which is zero. Also, we don't need a complicated lazy save mechanism for task switches. And we get one less input dependence because we don't need a separate register for vector length.
One catch is that vector registers need special alignment. For instance,
if your vector regs are 512bits and your DCache width is 512bits, you
want your vectors to be 512bit aligned when saved to the stack, so you
need to 512-align your stack pointer. Also, as the vector register size
increases, the stack size allocated to applications has to grow because
an app that was tested and known to work with 128bit vector saves/restores
might cause stack overflows if vectors become 512bits!

In light of this, I'd suggest:
- Scalar floats should get their own 32/64bit registers. In some C++
programs (games, audio, etc) they see lots of use and are often passed
to/from functions, so they need to be easy to save/restore. Since they have
the same 4/8byte alignment as 32/64bit ints, this is very easy to do if
you mandate an 8byte aligned SP in the calling convention.

- In fact, I'd even suggest a second, different stack for vector register
saves/restore. This way, the CPU can keep the VSP (vector stack pointer)
in perfect alignment all the time without having to do dynamic realignment
on the regular SP, and it can have special instructions that adjust the VSP
the right way for each vector, and the OS can grow the vector stack to make
sure that a program that expects 128-bit vector alignment will never generate
stack overflows when alignment grows to 256-bit+ etc...

I see the whole sequence for saving multiple vectors going something like this:

Code: Select all

SUB sp, 12
; special instruction that gets the number of empty bytes to adjust vsp so that
; the next vector save is aligned and can be done in a single DCache cycle
GETVECTORSTOREPREPAD r0, vsp, v0.size
ST r0, [sp + 0]
ADD vsp, r0
ST v0, [vsp]
GETVECTORSTOREPREPAD r1, vsp, v1.size
ST r1, [sp + 4]
ADD vsp, r1
ST v1, [vsp]
GETVECTORSTOREPREPAD r2, vsp, v2.size
ST r2, [sp + 8]
ADD vsp, r2
ST v2, [vsp]
The whole sequence can probably be streamlined somwhat but I
hope you get what I mean here.

Code: Select all

The disadvantage is, of course, that the compiler needs to distinguish between saving the data of a vector and saving everything for a later complete restore. Once we have this distinction, however, there is little extra cost to also saving a few carry flags. 
There's an extra cost, it's just in a different non-obvious place:
it forces the compiler to figure out if the carry bits are relevant
for each operation in a chain, and if the compiler can't figure it
out it will output less efficient. Whereas if carry flags are in their
own register, and only written/read by some operations (on ARM,
when SUB generates flags it is called SUBS and is a different
instruction), then the compiler only ever has to worry about carry
flags for instructions that expressly read/write them (ADDS / ADC /
ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just
becomes one extra register pool in the compiler's register allocator.

The use of separate flag registers that only get written to by a limited
subset of operations is also good for CPUs, because then it needs
a much, much less aggressive unit for handling flag register writes/reads
since it only needs to handle 1 write/read per cycle, instead of like
3/4+ like on x86.
The idea of having separate registers for system code only, sounds good. Would eight 64-bit registers be a suitable number?
8 registers for general purpose OS usage sounds good. It can probably
be all combined in an instruction for reading/writing CPU control registers,
like the CR0..CR15, DR0..DR15, MXCSR, MSW, GDTR, IDTR, LDTR, TR,
CS, SS, DS, ES, FS, GS, and floating point control registers on x64.

Some system operations could also be implemented as reads/writes to system
registers: CPUID could be replaced by a bunch of read-only system registers
that give the CPU model for instance.

Having lots of potential system registers available would probably help with
limiting the need to add new system opcodes as new features are added.

Joe Duarte wrote:
Hubert and Agner, I have a related but slightly tangential question, since we're talking about TLBs.

In our applications, within a virtual memory space, why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two. If the TLB is going to have to translate anyway, is it a problem to translate small addresses into conventional physical ones? Processes could have unique IDs - one byte would be ample most of the time - so the TLB would have a unique identifier for all addresses (process ID + the process' internal small memory addresses), and it would often take only two or three bytes total.

You could do a variety of things with pages and their sizes in that scenario. What am I missing here? I usually stick to managed languages.
ARM and x86 handle this by having a 32-bit mode. :3

You could definitively argue for going for a 32-bit
architecture with a standard 64-bit extension like what
Risc-V does. This makes total sense for an architecture
that's going to be used in embedded devices.

Even MIPS falls in this case: ultimately, it was used in way
more PS2's, set-top boxes and so forth than in servers.
Because of this, its 32bit support was a good thing
(and you can even argue that the infamous branch-delay-slot
was actually GOOD for MIPS).

----

Anyhow, the issue here I guess is that if your app has pointers
smaller than 4-bytes, well:

- 3-bytes can't be aligned to cache lines. If you have an array
of pointers, eventually one of the pointers has to straddle
2 cache lines and that's a much larger penalty than some
wasted DCache due to using 4-byte pointers.

- 3-bytes is only a 16 Mb address space, and few languages
want to be limited to something that small. You cannot start
with 3-byte pointer program and then dynamically upgrade
everything to 4-bytes if you ever run out of space. Might as
well make everything 4-byte from the start.

- 2-bytes is only 64 kb. Sure, you could have 2-byte pointers,
then have all sorts of data zone selectors to grow out of that:
this is exactly how 16-bit x86 works, and programs just totally
outgrew this. Even the late generation of DOS games are way
too large for this and do everything in 32bits.

Anyhow, this is also why the TLB is 2 levels on 32bit x86: a lot
of apps are only going to use a few hundred kilobytes of ram,
so it makes sense to only have page tables for small sections
of ram. On the top level page table, all the unused 4mb
blocks-of-pages are simply marked as invalid, and extra page
tables are allocated as the program's memory usage grows.
Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture? Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then (I couldn't find them in nginx or Apache source, or browsers I checked, and they could really benefit.) That raises a separate issue in getting developers to pay attention to ISAs and optimization wins...
Unfortunately, this kind of higher-level stuff typically has all
sorts of special cases and exceptions that make them
impossible to implement in fast circuits.

For instance, JS interpreters often have the trick of having
numbers as 64bit floats, and other types as an invalid
64bit float with a pointer in it. Well, now instead of just
adding two floats together like in C++, you first have to decide
whether one of your floats is actually a string and you have
to trigger string to double conversion.

Or worse even, in some languages it can be some type with
an overloaded conversion to double, and the conversion code
can call all sorts of functions generating all sorts of side effects
(like printing to the console), which means that the compiler
can't even change the order of two additions to make things
faster because it can introduce new bugs.

String parsing also falls in this kind of case. For instance,
checking the length of a string is not so obvious: often the
interpreter knows the number of bytes of a string, but some of
those bytes are UTF-8 characters, and if you account for special
cases (like ending up with Latin-1 aka CP-1252 text), then there's
often really no alternative to just looping through the string byte
per byte, in which case it's impossible to be faster than x86 / ARM /
MIPS even though those chips have no special support for this.

This is also why so few higher level languages support multiple
threads / multi-processor : there's just so many weird states that
the interpreter can be in, and there are just so many weird possible
side-effects, that you can't let a second core go through that data -
it would just create all sorts of nasty race conditions and so forth.

-----

Java and .NET are in a lesser class of badness, in that a Java
float is just a float and you can add it normally, and you can easily
run Java multi-threaded (and it has a nice way off adding mutexes
to objects). But then again, Java benefits way less from special
architectures - if your Java program can't run fast on x86, then it
can't run fast on anything!

The real cost of Java is the Garbage Collector, which makes it
impossible to avoid garbage-collection pauses, which are often
longer than 100ms, or even occasionally longer than 300ms.
This is perfectly fine for server software (and Java is an enterprise
language so it fits this perfectly), but it makes Java unsuitable for
software that can't really have pauses like games and heavy-duty
user-facing GUI software (Photoshop, C++ IDEs, Word, iPhone
apps etc). This cannot be solved by processor design.


Agner wrote:
If we don't need extra carry bits for high precision arithmetics, then the only need for these extra bits is for detecting and propagating integer overflow.

If we don't have the extra carry flags, then we will need special instructions for checking if an addition, multiplication, etc. overflows. After each addition or other arithmetic instruction, issue another instruction with the same inputs just to check if the operation overflows. The outputs of all the overflow checks for a series of calculations should then be OR'ed together. The total number of instructions will be three times the number of instructions needed without overflow check.

So what are the costs and benefits? Without the extra flag bits we need three times as many instructions if we want to check for integer overflow. We can avoid overflow in many cases by using 64-bit integers, but even 64-bit integers can easily overflow if the calculation has many multiplications. Compiler support for checking integer overflow is rare, unfortunately, so any mechanism for detecting integer overflow will perhaps not be used much. On the other hand, if an efficient method was available, we would probably see compilers that support it and programmers that use it. In fact, many programmers are frustrated over how difficult it is to detect signed integer overflow. Traps for integer overflow exception is not good for vector code. One solution is to use floating point calculations instead of integer, but I don't see that solution used much. Floating point calculations have longer latencies, of course. No solution seems to be really good here.
Dunno, I write C++ sound code, and for code with lots
of multiplications that could overflow, I use floating point
code _all the time_. :3

Floating point does have higher latency, but that's true of
any code that uses lots of multiplication. Float multiplication
also has the really nice property that you need a lot less
shifts and clamps than fixed point code, so for algos with tons
of multiplications (say, FFT for instance), floating-point
is really really useful.

If I need to do multiplications in integer code (for instance,
in a FM synthesizer - can't use float because converting
float to int to use as array indexes is slow), then what I
generally do is that I just make sure my * operands are
essentially 16-bits (or, say, one operand is 24-bits and the
other is 8-bits), and carefully use lots of >> bitshifts to keep
everything in range. For sound generation, this is generally
fine because the human ear isn't that 'precise' anyways
(in terms of bits).

Another special case was writing ARM NEON code, which had
lackluster floating-point and only a 16x16 multiplier in the hardware
(32x32 multiplies had a significant penalty). So I used a lot of a
particular opcode called VQDMULH.s16 - 16x16->16 vector
signed doubling multiply keeping the top part of the result and
clamping the special case of -32768* -32768 to 32767 - equivalent to this:

Code: Select all

res = (int16_t)a * (int16_t)b;
res = res >> 15;
if(res == 32768)
res = 32767; // clamp the particular case of -32768 x -32768
(int16_t)out = res
That being said, I have had to use 32x32->64 wide multiplication
in code 2 or 3 times. But I think it's a special case, compared to
the 32bit float multiplication, which I use all over the place!


Author: Agner Date: 2016-04-03 13:49
Hubert Lamontagne wrote
One catch is that vector registers need special alignment. For instance, if your vector regs are 512bits and your DCache width is 512bits, you want your vectors to be 512bit aligned when saved to the stack, so you need to 512-align your stack pointer.
The code has to be compatible with different processors with different vector sizes, so we don't want the required alignment to depend on the processor. A separate stack for vectors is a possible solution, but very wasteful. Cache space is a limiting resource so I don't want to save a possibly very long vector when only a small part of it is used. This is why I think it is smart to save the vector length in the vector register itself. A save instruction will save only as much as is needed. In the caller-save situation, the function knows how much of the register is saved, so it can use a normal store instruction. In the callee-save situation, the function will rely on the register length information and use the special save instruction to save only what is needed. This situation will be rare anyway, because the 16 caller-save registers will be enough for must purposes.

Separate registers for floating point scalars would be useful if we had to save the full vector in a callee-save situation, but the length information eliminates this need.

I think we can live with 8-bytes alignment of vectors. The hardware can handle this with a simple barrel shifter, but it may have to load an extra cache line, of course. Large arrays should be aligned by the cache line size for optimum performance.

I think the format for saving vector registers with the special save instruction should be implementation dependent. It may use a single byte for the length if the length cannot exceed 128 bytes, or it may use more for longer vectors or for the sake of alignment. It may even compress the data if this can be done fast enough. For example, a boolean mask vector using only one bit of each 64-bit element can obviously be compressed a lot. The format will be padded to fit whatever alignment is optimal on the particular processor. The software should use data in the special "save format" for no other purpose than to restore the register on the same processor.

It is a disadvantage that the saved format may be longer than the maximum vector length when it includes the length information. But I think this is outweighed by the advantage that most saved registers will use less space. Many registers will be unused and store only a zero for the length.
There's an extra cost, it's just in a different non-obvious place: it forces the compiler to figure out if the carry bits are relevant for each operation in a chain, and if the compiler can't figure it out it will output less efficient. Whereas if carry flags are in their own register, and only written/read by some operations (on ARM, when SUB generates flags it is called SUBS and is a different instruction), then the compiler only ever has to worry about carry flags for instructions that expressly read/write them (ADDS / ADC / ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just becomes one extra register pool in the compiler's register allocator.
As I wrote, I have found an alternative solution for add with carry. We only have to consider whether we need an efficient way of tracking integer overflow.
CPUID could be replaced by a bunch of read-only system registers that give the CPU model for instance.
Good idea!

Joe Duarte wrote:
Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses?
My priority is the performance of big systems. That's why I have 64-bit address space. All addresses are relative to some pointer (instruction pointer, data section pointer, stack pointer, or an arbitrary pointer) with a signed offset of 8 bits or 32 bits. The instruction size will not be reduced by having a smaller address space, but of course we could save some stack space by having a 32-bit mode. I don't like having two different modes, though. Then we would have problems with stack alignment for doubles, etc. Byte code languages can have their own smaller address space of course.
String parsing also falls in this kind of case. For instance, checking the length of a string is not so obvious: often the interpreter knows the number of bytes of a string, but some of those bytes are UTF-8 characters, and if you account for special cases (like ending up with Latin-1 aka CP-1252 text), then there's often really no alternative to just looping through the string byte per byte
I think we should use UTF-8 only. It is possible to search for a terminating zero by loading a full vector and compare all bytes in the vector with zero. My ABI requires a little extra space at the end of user memory to avoid access violation when reading past the end of a string that happens to be placed at the very end of user memory. But of course it is more efficient to save the length of the string.
I write C++ sound code, and for code with lots of multiplications that could overflow, I use floating point code _all the time_. :3
Hardware multipliers are expensive, and divisors are even more expensive. I wonder if we need to support multiplication and division of all operand sizes, including vectors of 8-bit and 16-bit integers, if programmers are using floating point anyway?


Author: Joe Duarte Date: 2016-04-03 16:51
Joe Duarte wrote:
Since it's a program's exclusive virtual memory space,
a universe of our own, why can't we use arbitrary and
much, much smaller addresses?
Agner replied:
My priority is the performance of big systems. That's
why I have 64-bit address space. All addresses are
relative to some pointer (instruction pointer, data
section pointer, stack pointer, or an arbitrary
pointer) with a signed offset of 8 bits or 32 bits.
The instruction size will not be reduced by having a
smaller address space, but of course we could save
some stack space by having a 32-bit mode. I don't like
having two different modes, though. Then we would have
problems with stack alignment for doubles, etc. Byte
code languages can have their own smaller address
space of course.
I just don't like all the waste with 64-bit types and pointers. It becomes less of an issue over time, but it still bothers me. Relatedly, I like what these Berkeley people did with Precimonious, a tool that tunes floating point precision and eliminates some of the waste: www.eecs.berkeley.edu/~rubio/includes/sc13.pdf

Question: Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR? It seems like the number of registers is a fiction anyway (I was stunned to discover that x86-64 processors from Intel and AMD have nearly 200 physical registers.) This would make the number of registers implementation-dependent, rather than part of the ISA specification. See Vikram Adve's recent talk at Microsoft: http://research.microsoft.com/apps/vide ... ?id=249344 (The Microsoft Research people must have been having a bad day or something – their questions and comments reveal that they thoroughly misunderstood his ideas.)

Question 2: Let's assume that we have registers R0 - R31. Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one". I can imagine some scenarios where this might be useful for a compiler. And it seems to fit with the reality of register renaming anyway.


Author: Hubert Lamontagne Date: 2016-04-03 19:00
Suppose the SP is at 0x02018, and the L1 cache lines are 64 bytes in size, and you want to save a vector that's, say, 24 bytes long (6*32bit floats). Then, you need to first save the control word that tells you the size, compression etc of the vector. Fair enough, the vector data goes to 0x01FE8..0x02017. And then you have to save the vector size control word, which puts you at 0x01FE4 if you assume 32bits... but this doesn't work because then your SP is not 8-byte aligned anymore for 64bit integer and floating-point value. So you must save the vector size word to 0x01FE0 instead, with some extra padding (and the CPU either stores the amount of padding in the vector size word, or recalculates the amount of padding from SP alignment and vector size when reloading the vector).

Another possibility is that you could add some post-padding, so that the vector line is saved to 0x01FD8..0x01FFF and the control word goes to 0x01FD0, so that the whole thing fits in a single cache line. The amount of post-padding must be saved in the vector size control word.

Yeah, it's doable. But it's a long multicycle instruction, probably microcoded - after all, it writes an unpredictable amount of bytes to unpredictable offsets, often spanning 2 different cache lines, and updates the SP, and involves multiple address calculations to figure out just how much pre-padding and post-padding you need to do to keep your stack and your data well aligned. And it's very likely to completely block memory operation reordering (ie act like a memory barrier) because it's too difficult for concurrent memory operations to figure out whether they will overlap or not.

Agner wrote:
Hardware multipliers are expensive, and divisors are even more expensive. I wonder if we need to support multiplication and division of all operand sizes, including vectors of 8-bit and 16-bit integers, if programmers are using floating point anyway?
Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering (when OpenGL/DirectX are unavailable due to software constraints, such as running as a plugin). For scalars, 32*32->32 multiplies cover everything (and are common in C++ code), but some CPUs also provide 16*16->32 multiplies because they run faster (ARM).


Author: Agner Date: 2016-04-04 08:50
Joe Duarte wrote:
Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR?
What do you mean? If we have 1023 virtual registers like LLVM then we need 10 bits in the opcode for each register. Or do you want a rolling register stack? Then we have a problem when the register stack overflows. That would be quite wasteful if the overflow happens in the innermost loop.
Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one".
All of the registers behave like that in a superscalar processor. You ask for a particular architectural register and you get a random physical register - you don't even know which one you have got.

Hubert Lamontagne wrote:
then your SP is not 8-byte aligned anymore for 64bit integer and floating-point value. So you must save the vector size word to 0x01FE0 instead, with some extra padding (and the CPU either stores the amount of padding in the vector size word, or recalculates the amount of padding from SP alignment and vector size when reloading the vector).
Yes, this is a complication. I want to handle it in software without any complex instructions. If the size of the saved register image is not guaranteed to be a multiple of the stack word size then I would first calculate the amount of space needed for all the vector registers I want to save, then save the stack pointer to another register, then subtract the necessary size from the stack pointer, then align the stack by 8 ( AND SP,-8 ), then use a temporary register as pointer to the save area, and then save the registers, incrementing the pointer each time. The restore process is easier. There is no problem with saving registers during a task switch because you have a pre-allocated space that is big enough to cover all cases.

The alternative is to give the saved image of each register a size that is a multiple of the stack word size. This will make it easier to spill registers on the stack at the cost of using more space. It will make it simpler for the compiler and also easier for the hardware because of the better alignment. The cost is that it uses more cache space, which is probably more expensive than the extra instructions. If all vector registers are unused, then we will need only one or two bytes for saving each with the first method, versus eight bytes with the second method.

It is difficult to weigh the costs/benefits of these two solutions against each other, but you are right that the first method is very complicated.
Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering
x86 does not have a complete set of vector multiply instructions. NEON has 8, 16, and 32 bits. I don't see what you would you need 8-bit vector multiply for?


Author: Hubert Lamontagne Date: 2016-04-04 21:01
Joe Duarte wrote:
Question: Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR? It seems like the number of registers is a fiction anyway (I was stunned to discover that x86-64 processors from Intel and AMD have nearly 200 physical registers.) This would make the number of registers implementation-dependent, rather than part of the ISA specification. See Vikram Adve's recent talk at Microsoft: http://research.microsoft.com/apps/vide ... ?id=249344 (The Microsoft Research people must have been having a bad day or something – their questions and comments reveal that they thoroughly misunderstood his ideas.)
This has a cost:
- You can make instructions variable-sized to accomodate different numbers of registers, but this increases branch mispredict penalty and makes it hard to run multiple instructions at the same time.
- What if your cpu has 32 registers and the program uses the 33rd? You can spill values to memory, but then the CPU has to figure out that it doesn't conflict with any other memory reads/writes.
- More registers = instructions become larger. MIPS instructions would take 42bits instead of 32bits if you had 1024 registers instead of 32.
- Larger register files are slower, which reduces clock rate or causes more stalls due to more latency cycles required to get register values.
Question 2: Let's assume that we have registers R0 - R31. Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one". I can imagine some scenarios where this might be useful for a compiler. And it seems to fit with the reality of register renaming anyway.
Okay, you can write a result to "Rx", but how do you find which register that "Rx" is once you want to read back your result and do something with it? What if you write to multiple "Rx"'es, how do you keep track of what went where?

------------------------------------

Agner wrote:
Yes, this is a complication. I want to handle it in software without any complex instructions. If the size of the saved register image is not guaranteed to be a multiple of the stack word size then I would first calculate the amount of space needed for all the vector registers I want to save, then save the stack pointer to another register, then subtract the necessary size from the stack pointer, then align the stack by 8 ( AND SP,-8 ), then use a temporary register as pointer to the save area, and then save the registers, incrementing the pointer each time. The restore process is easier.
Oh, I see. You'd use some kind of massive "vector store/load (including size prefix byte)" instruction that's basically never aligned to save all the vectors, then reestablish stack alignment. And for C++ ABI, you'd force all caller functions to save all SIMD vectors and floats and doubles, and use the caller's knowledge of what's in the registers to do a simpler aligned non-variable-sized save (instead of a massive unaligned variable-sized save)... On paper it works, but for some reason I find that rather scary.... :3

For instance, if you have a function working on a bunch of scalar floats, and then it calls some sub-function (say, "sqrt()" or something like that), won't it have to spill every single float register that it's working on unto the stack (potentially on every iteration of a loop)?
Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering
x86 does not have a complete set of vector multiply instructions. NEON has 8, 16, and 32 bits. I don't see what you would you need 8-bit vector multiply for?
I'm thinking of the case of 32bpp RGBA bilinear interpolation/texture mapping/rotozoom, along with alpha blending, for the cases where you can't use OpenGL (for instance in Macromedia Flash). That's a notorious CPU pipeline buster, because the algo eats up a ton of small multiplications.


Author: Joe Duarte Date: 2016-04-06 19:51
Hubert said:
- 3-bytes can't be aligned to cache lines. If you have an array of pointers, eventually one of the pointers has to straddle 2 cache lines and that's a much larger penalty than some wasted DCache due to using 4-byte pointers.
- 3-bytes is only a 16 Mb address space, and few languages want to be limited to something that small. You cannot start with 3-byte pointer program and then dynamically upgrade everything to 4-bytes if you ever run out of space. Might as well make everything 4-byte from the start.
You're assuming byte-addressable memory. I'm assuming that these pointers or references would point to memory objects of arbitrary size, determined by what the variable, object, or function needs. I don't see why a program can't just tag its objects and entities in a virtual memory space with clean and compact pointers (but without garbage collection – just virtual memory.) I feel like there's not enough *virtual* in virtual memory right now – we should be able to abstract more.
Agner wrote:
Joe Duarte wrote:
Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR?
What do you mean? If we have 1023 virtual registers like LLVM then we need 10 bits in the opcode for each register. Or do you want a rolling register stack? Then we have a problem when the register stack overflows. That would be quite wasteful if the overflow happens in the innermost loop.
I mean register assignment could be sorted out at install time, or what's commonly called Ahead of Time compilation. One way to think of it is that some of what an LLVM back end does could be postponed until the application is installed and the precise characteristics of CPU are known. If you look at the SPIR-V IR just released from Kronos, I think some of the optimization will happen at install. And I think the Mill CPU architecture does this as well – the code is partly compiled into a "specification" until it's installed and knows which version of the CPU the user has.

Adve is talking about something similar in his talk. Also see Nuzman et al's "Vapor SIMD: Auto-Vectorize Once, Run Everywhere": https://www.irisa.fr/alf/downloads/roho ... _CGO11.pdf

On the issue of a Rx register, I think it might be a useful abstraction in some cases. You said that CPUs do this already, that we don't know what register we're getting. Yet, we're still naming registers explicitly, and you find it useful to retain named registers in an ISA. There are benefits to have named architectural registers, and I think there would be benefits from anonymous register semantics. Hubert asked how we'd refer back to it. There would be rules about how such register semantics could be used, and how to manage them – they wouldn't be the same as the normal registers. There are few ways to go about it.

Author: Hubert Lamontagne Date: 2016-04-07 23:32
Joe Duarte wrote:
Hubert said:
- 3-bytes can't be aligned to cache lines. If you have an array of pointers, eventually one of the pointers has to straddle 2 cache lines and that's a much larger penalty than some wasted DCache due to using 4-byte pointers.
- 3-bytes is only a 16 Mb address space, and few languages want to be limited to something that small. You cannot start with 3-byte pointer program and then dynamically upgrade everything to 4-bytes if you ever run out of space. Might as well make everything 4-byte from the start.
You're assuming byte-addressable memory. I'm assuming that these pointers or references would point to memory objects of arbitrary size, determined by what the variable, object, or function needs. I don't see why a program can't just tag its objects and entities in a virtual memory space with clean and compact pointers (but without garbage collection – just virtual memory.) I feel like there's not enough *virtual* in virtual memory right now – we should be able to abstract more.
Okay, but how do you get the individual fields out of the object you've got the reference of? Then you need both the object's handle/tag/reference/id, and an offset from the start of the object data (traditionally in bytes), or at least some kind of field ID (but that adds yet another translation pass to get the real byte offset).

The other issue is that you're probably going to hammer the TLB a lot harder that way: you're making the TLB hold all your real pointers instead of just page remaps. Which means you'll probably need a bigger, more complex TLB.

Third, this doesn't play well with C++ specifically, because objects aren't necessarily housed within their own memory allocation, they can be contained inside a larger object, or inside an array, or on the stack. So you need to save not only the handle, but also the byte offset. This is essentially the same thing as the infamous FAR pointer from 16-bit x86 coding.

Joe Duarte wrote:
https://www.irisa.fr/alf/downloads/roho ... _CGO11.pdf

On the issue of a Rx register, I think it might be a useful abstraction in some cases. You said that CPUs do this already, that we don't know what register we're getting. Yet, we're still naming registers explicitly, and you find it useful to retain named registers in an ISA. There are benefits to have named architectural registers, and I think there would be benefits from anonymous register semantics. Hubert asked how we'd refer back to it. There would be rules about how such register semantics could be used, and how to manage them – they wouldn't be the same as the normal registers. There are few ways to go about it.
Then you've either got a register stack like the 8087 fpu or some ultra-low-power CPUs (GreenArrays chips, designed to run Forth), or a queue like in the Mill (which it calls the "belt") or the Itanium rotating register file, depending on if your system forgets the newest values after use or the oldest ones.


Author: HarryDev Date: 2016-04-08 05:46
Hi, related to the JIT discussion I think the proposed ISA needs specific instruction for memory copying, initialization and comparison. Especially, memory copying from small number of bytes to big is important in GC related scenarios. Often memory copy ends up being implemented as multiple complicated vector instructions, when an ISA instruction would be so much better (the CPU could then handle this in the most optimized way for it). Some research has been done on this where adding "cpblk", "initblk" instructions where evaluated and these showed great benefit for code size and speed. I would allow these instructions to have element size defined i.e. 1, 2, 4, 8, 16 bytes for example so one can initialize a "double" array quickly, perhaps even a zero out "zeroblk" instruction since this is used a lot in managed memory scenarios.


Author: Hubert Lamontagne Date: 2016-04-09 00:06
HarryDev wrote:
Hi, related to the JIT discussion I think the proposed ISA needs specific instruction for memory copying, initialization and comparison. Especially, memory copying from small number of bytes to big is important in GC related scenarios. Often memory copy ends up being implemented as multiple complicated vector instructions, when an ISA instruction would be so much better (the CPU could then handle this in the most optimized way for it). Some research has been done on this where adding "cpblk", "initblk" instructions where evaluated and these showed great benefit for code size and speed. I would allow these instructions to have element size defined i.e. 1, 2, 4, 8, 16 bytes for example so one can initialize a "double" array quickly, perhaps even a zero out "zeroblk" instruction since this is used a lot in managed memory scenarios.
That's why x86 has the 'string copy' etc instructions. And the speed gain depends on the CPU generation...:

- On 8086 / 80286 / 80386, loading instructions was slow (no instruction cache!), so they string instructions were the fastest indeed. Other CPUs of the time also have stuff like this: z80 has string copies too, as does 65816, and 68000 has move multiple.

- On 486 and Pentium, designers realized that non-microcoded instructions can be implemented to run a lot faster, but there was only so much space on chips so only few instructions could get the speedup, so string instructions were left out and actually ran slower... RISC CPUs don't have any string copy instructions either - it just doesn't fit in their design!

- On later out-of-order x86 CPUs, string copies are still microcoded so they're slow to start, but once started they do wide multibit copies. So the end result is similar to the multiple complicated vector instructions anyways (since CPUs still have to deal with alignment). On the latest Ivy bridge and Haswell, string copy instructions are now the same speed as software loops - but that's a new thing: on preceding architectures, string copy instructions are still slower... RISC architectures like ARM mostly went with vector instructions to fill this need.

So I'm not sure that string copy/zero/compare instructions are still relevant nowadays because memory bandwidth is not all that fast, and the kind of pointer updating that was costly on 8 and 16 bit CPUs is now free since it runs in ALUs in parallel while the data cache does its job. And having CPU string copy instructions doesn't remove the need for complex cacheline-aligned loops - it's really just moving that complexity from the software to the microcode. I think it would be better to look into garbage collection algorithms that don't do compaction and don't recopy everything (and an end to Java's "allocate humongous blocks of memory on startup" approach).

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

How about stack machine ISA?

Post by agner » Thu Nov 02, 2017 9:43 am

Author: A-11 Date: 2016-04-10 07:35
It should be too late.
but since I found a research today, I ask for it.

BOOST: Berkeley's Out-of-Order Stack Thingy
https://www.researchgate.net/publicatio ... ack_Thingy

Same as "register renaming" for register machine, these researcher propose "address renaming" for stack machine to implement out-of-order execution.
So I think there are no reasons to prevent stack machine for performance problem.
Since stack machine instruction sets never bother us with number of registers, I think stack machine's ISA is more extensible than register machine's.
Also, stack ISA is easer for compilers as UCSD p-System showed.
it should also hold true for Just In-time Compiling.


treating stack ISA as CISC architecure
Author: A-11 Date: 2016-04-14 02:55
Above BOOST architecture converts each one stack instruction to corresponding one RISC-like micro code.
Since stack ISA is more fine than even load-store RISC ISA which puts opcode field and operand field together in a instrcution, BOOST generates more RISC-like micro code, requires more massive monster renaming unit than even RISC architecture.
I think this is the major difficulty of BOOST, same as RISC requires more instruction throughput than CISC.
Since each entry of reservation station in a conventional CISC machine handles "macro code" which represents read-modify-write sequence and generates several RISC-like micro code, CISC machines beat RISC machines.
I think the source of CISC power is such a powerful RS entry, which is toward a tiny microprocessor same as EDGE.
Explicit_Data_Graph_Execution - Wikipedia
http://en.wikipedia.org/wiki/Explicit_D ... _Execution
This means more powerful macro code reduces more pressure on a renaming unit, thus exploits more performance.
For example, read-modify-read-modify-read-write macro ops such as below line is preferred to read-modify-write ops.
read [sp + 2] ; read [sp + 4] ; mul ; read [sp + 5] ; add ; load-from-heap ; write [sp + 2] ;
But you will see this one macro op is similar to a sequence of 7 instructions written in stack ISA.
Thus, I think stack ISA can treat as CISC architecture with instruction border marking unit.
Maybe algorithm like matching parentheses will be used at border marking.
Note that instruction border is determined by machine implementation, not by ISA like x86.
If a machine can't deal with big macro ops like above, its border marking unit can split stack ISA sequence to fit its macro ops.
read [sp + 2] ; read [sp + 4] ; mul ; //read-read-modify-(write to stack)
read [sp + 5] ; add ; //(read from stack)-read-modify-(write to stack)
load-from-heap ; write [sp + 2] ; //(read from stack)-read-write
Or, since simple stack machine (one issue, in-order) is so tiny that it may fit in a RS entry, you might be able to treat a block separated between jump instructions as one macro op.


treating stack ISA as CISC architecure
Author: Agner Date: 2016-04-14 09:03

Thank you for the reference to Explicit Data Graph Execution (EDGE) https://en.wikipedia.org/wiki/Explicit_ ... _execution
If the EDGE principle can be implemented smoothly, it might be more efficient than splitting a job into multiple threads running in multiple cores. The thread management, synchronization, and communication between threads is very complicated and inefficient in current systems. An alternative would certainly be worth exploring.

However, I don't quite understand how the Hyperblocks are coded and delimited.


Author: A-11 Date: 2016-04-17 01:51
Below recipe is what I'm reaching and can present now.

-- begin recipe --
(0)
Prepare an empty stack "Parser stack".
Each item of stack holds immediate value or loadiing from architectural stack which must be renamed to register file.
But the bottom item of this stack may hold a Hyperblock.

(1-arg)
If next instruction is an immediate or loading from architectural stack, push it into Parser stack.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5 ;
next instruction : [sp+5]
Parser stack after : { [sp+3] ; $4 ; add } ; $5 ; [sp+5] ;

(1-op)
If next instruction is a N-in-1-out (inc, add, fma3, etc) or N-in-0-out ("store [sp+2] top-of-stack", etc) micro op
[1] Pop N items from Parser stack, concatenate them with the next instruction into a Hyperblock.
[2] If items remain in Parser stack, output each of them as individual Hyperblock.

[3-1out] If next instruction is 1-out op, leave the Hyperblock made at [1] on the bottom of Parser stack.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5 ; [sp+5] ; $7
next instruction : add(2-in-1-out op)
outputs :
{ [sp+3] ; $4 ; add } ;
$5 ;
Parser stack after : {[sp+5] ; $7 ; add } ;

[3-0out] If next instruction is 0-out op, output the Hyperblock at [1], thus Parser stack should be empty.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5 ; [sp+5] ; $7
next instruction : store [sp+2] top-of-stack (1-in-0-out op)
outputs :
{ [sp+3] ; $4 ; add } ;
$5 ;
[sp+5] ;
$7 ; store [sp+2] top-of-stack ;
Parser stack after : (empty)

(1-other)
For other cases, 2-out op like "divmod", reaching max length of Hyperblock, control flow instruction and so on.
output each of items in Parser stack as individual Hyperblock, thus Parser stack should be empty.

(2) To iterate, go back to ether of (1-*) depending on next instruction.
-- end recipe --

Though I mentioned before RS entry should be toward individual processor, they must be tiny enough to lack lots of unit.
Anyway, each RS entry in a current conventional CISC processor lacks ability for flow control , "mov" elimination by renaming, nor Out-of-Order execution.
So Hyperblock should not contain "mov" instruction nor concurrent flow like below.
($4 $5 add) ($2 $3 shift-left) // these process have no dependency, thus should detect concurrency.
swap-top-2-item div // "swap-top-2-item" should be eliminated with renaming.
This is the reason Parser stack should become empty every time when it meets a micro op.
Parser stack is parse time emulation of the stack inside a RS entry.
You may point out
$4 ($2 $3 shift-left) add
has no concurrent flow.
I ignore them to make recipe simple to implement easily on hardware.

For ops which have 2 or more outputs, I have not gotten the easy way how to treat them.


Author: Hubert Lamontagne Date: 2016-04-17 14:35
Agner wrote:
Thank you for the reference to Explicit Data Graph Execution (EDGE) https://en.wikipedia.org/wiki/Explicit_ ... _execution
If the EDGE principle can be implemented smoothly, it might be more efficient than splitting a job into multiple threads running in multiple cores. The thread management, synchronization, and communication between threads is very complicated and inefficient in current systems. An alternative would certainly be worth exploring.

However, I don't quite understand how the Hyperblocks are coded and delimited.
I do have a design in store that does what EDGE does, and should be targettable from C++ compilers but it's kinda weird:

- Registers are the accumulator (ac), and a rotating register file with 4 partitions of 16 registers (a0-a15, b0-b15, c0-c15, d0-d15). The ALLOC instruction shifts down register file names, so for instance ALLOC d10..d15 will move down the previous contents of d15 to d9, the content of d14 to d8, d13->d7, d12->d6, d11->d5, d10->d4, d9->d3, d8->d2, d7->d1, d6->d0, and the contents of d0..d5 are lost. The new values in registers d10..d15 are marked as "uninitialized" and instructions that try to read them will stall until the registers are written to. ALLOC is always in the form of aN..b15, bN..b15, cN..c15, dN..d15 and can allocate from multiple partitions at the same time.

- Each non-accumulator register can only be written to once (!). This means that it's basically a form of hardware SSA. This is why there are multiple partitions: values have multiple classes of life duration: extremely short (accumulator), very short (would probably use d0..d15, typically used for multi-use temporary values and merging 2 branches of calculation), loop counters (c0..c15, get rewritten every loop), and then various long lived values like loop constants and stack pointers and the like (aN and bN).

- Instructions come in groups. Each group is a sequence of serial instructions, and operates on the accumulator: the first operation of the group must not have any dependency on the previous state of the accumulator, but then subsequent instructions modify the accumulator. Every instruction always writes to the accumulator, plus optionally also has a store to a register from the rotating register file. Example of group:
mul ac, d14, b15 ;no dependency on the previous state of the accumulator, start of group
sar ac, 16
add ac, d13, st d12 ;result goes in both accumulator and d12
add ac, d15, st d11 ;result goes to both accumulator and d11, group ends

- Memory loads/stores must use [register+immediate] as addressing mode (it's not possible to use the accumulator as part of address calculations), and are separately issued in-order, independent of ALU instruction grouping.

- The ALLOC instruction is also not part of ALU instruction grouping.

- The reason for this weird design is that it makes register renaming very easy: every time an ALLOC instruction comes up, you simply increase the rotation index for the target register partitions and clear the 'value ready' bit for the new register names (and also check that the physical registers you're using are free). Once that is done, every single following ALU operation group is _guaranteed_ to be parallelizable (aside from the memory loads/stores) because every register name is unique!

- Each new ALU instruction group is queued to a free ALU, and each ALU runs one instruction per cycle from its group of cycles. If the instruction's rotating register file operand doesn't have its 'value ready' bit on, then this instruction group stalls until the value becomes ready (in other words, it waits until the value gets stored by an ALU or memory load operation).

- Register file renaming can be done late by the ALUs, since the rotating register file index of each partition is recorded when the group starts. This also means that every ALU can practically have its own ICACHE - they can be _scheduled_ 100% independently from other instructions!

- Multiple ALU groups can be scheduled per cycle, in any order(!). The only thing that has to be done in order is register renaming using the ALLOC instruction and memory loads and stores.

- For the C++ compiler, working from SSA, it has to tag all the operations that are part of a series of operations, to form groups. This is not that hard as far as I can tell: every time a value is only used by the next operation in the sequence, then it can be passed using the accumulator instead. Generally, all single-destination values can be assigned to the accumulator. Since the compiler knows how long each value lives, it can make sure register file renames using ALLOC never wipe out values that are still in use (although it does have to figure out the lifespan of each value).


This design still needs lots of refinement (especially in the area of loads/stores, calling convention...) and is more complex than I'd like (hence the 'this is weird' warning), and probably also fairly out-there, and potentially doesn't gain too much over the traditional out-of-order RISC (if the whole thing is limited by memory performance in particular), and reminds me of the Itanium at times, but it does have the advantage of being relatively implementable and compilable, and in theory there's no limit to the number of ALUs you can run in parallel (I can easily see designs with 4, 8, even more ALUs).


stack ISA versus long vectors
Author: Agner Date: 2016-04-18 00:37
Hubert Lamontagne wrote:
I do have a design in store that does what EDGE does, and should be targettable from C++ compilers but it's kinda weird:

- Registers are the accumulator (ac), and a rotating register file with 4 partitions of 16 registers (a0-a15, b0-b15, c0-c15, d0-d15). The ALLOC instruction shifts down register file names, so for instance ALLOC d10..d15 will move down the previous contents of d15 to d9, the content of d14 to d8, d13->d7, d12->d6, d11->d5, d10->d4, d9->d3, d8->d2, d7->d1, d6->d0, and the contents of d0..d5 are lost. The new values in registers d10..d15 are marked as "uninitialized" and instructions that try to read them will stall until the registers are written to.
Thank you for explaining your idea. It might be a problem that you have only one accumulator.

The best candidate for an independent instruction block is a loop iteration with no loop-carried dependency. I think it would be easier for the compiler in this case to just use a very long vector register with variable length to cover multiple iterations of the loop at once.

The main problem with very long vectors is instructions that move data horizontally across a vector. The latency of horizontal data moves may increase with the vector length. I have an idea to mitigate this problem a little. All instructions that involve horizontal data movement across a vector have information about the distance of the move (e.g. index or shift count) in a separate register or an immediate constant. The scheduler wants to know the latency of the instruction as early as possible. It will be allowed to read the "distance register" at an early stage in the pipeline before the other operands are ready. This value will typically be available early anyway thanks to out-of-order execution. There is probably no way to avoid the data transfer delay, but horizontal moves will be rare anyway. Current designs are already reading registers used in address calculation earlier in the pipeline than operand registers. I want to use a similar mechanism for predicting instruction latency.

It will also be an advantage to know the vector length early so that it can clock gate or power down unused parts of the buses and ALUs to save power.


Author: Hubert Lamontagne Date: 2016-04-19 23:16
Agner wrote:
Thank you for explaining your idea. It might be a problem that you have only one accumulator.
It's on purpose! If you need a second accumulator, that means that your calculations has two branches to it (either from a split or a join), which means it has some parallelism to it, which means you want it to run on more than one ALU to exploit that parallelism, which means you need some sort of register to send values between the ALUs. That register for sending values between ALUs can be a second accumulator, but then that's a lot harder to rename because then the scheduler has to go through the whole stream of instructions and give a whole bunch of different names. What I'm suggesting is that this crazy renaming should be done beforehand, and that's what the whole rotating name single-write register file is there for: it gives a different names to all the renamed versions of the second accumulator beforehand, so that the scheduler has to do renames much less often, and it can do a whole bunch of renames at the same time.

Agner wrote:
The best candidate for an independent instruction block is a loop iteration with no loop-carried dependency. I think it would be easier for the compiler in this case to just use a very long vector register with variable length to cover multiple iterations of the loop at once.
The main problem with very long vectors is instructions that move data horizontally across a vector. The latency of horizontal data moves may increase with the vector length. I have an idea to mitigate this problem a little. All instructions that involve horizontal data movement across a vector have information about the distance of the move (e.g. index or shift count) in a separate register or an immediate constant. The scheduler wants to know the latency of the instruction as early as possible. It will be allowed to read the "distance register" at an early stage in the pipeline before the other operands are ready. This value will typically be available early anyway thanks to out-of-order execution. There is probably no way to avoid the data transfer delay, but horizontal moves will be rare anyway. Current designs are already reading registers used in address calculation earlier in the pipeline than operand registers. I want to use a similar mechanism for predicting instruction latency.
It will also be an advantage to know the vector length early so that it can clock gate or power down unused parts of the buses and ALUs to save power.
Hmm, that's not too hard to do if the vector lengths are in the normal register file, and the SIMD unit is running late compared to the scalar integer unit. Maybe you could have multiple swizzles with different granularities and delay times - like a 1 byte granularity scramble, and then maybe a 4 byte scramble, 16 byte scramble, a 64 byte scramble... Maybe the cpu could look in the swizzle data and figure out how many 0 bits there are and use the coarsest correct granularity.


Author: Agner Date: 2016-04-20 00:08
Hubert Lamontagne wrote:
If you need a second accumulator, that means that your calculations has two branches to it (either from a split or a join), which means it has some parallelism to it
OK, I see.
that's not too hard to do if the vector lengths are in the normal register file, and the SIMD unit is running late compared to the scalar integer unit. Maybe you could have multiple swizzles with different granularities and delay times - like a 1 byte granularity scramble, and then maybe a 4 byte scramble, 16 byte scramble, a 64 byte scramble... Maybe the cpu could look in the swizzle data and figure out how many 0 bits there are and use the coarsest correct granularity.
That's indeed very close to what I have in mind. A permute instruction will have the granularity given by the operand type, one vector output, two vector inputs for input data and indexes, and then a "block size" given by an integer register. For example, if the block size is 16 bytes then data can only be moved within each 16-bytes block of the vector. There has to be a maximum block size anyway because the complexity of the circuit grows with the square of the block size and the latency also grows with the block size. The maximum block size may be implementation dependent and it may be smaller for low granularities.


Author: A-11 Date: 2016-04-18 08:09
Well..., I forgot explaining why the bottom item of Parser stack can hold Hyperblock and other items can't.
Forgive me adding a loaf of explanation below.

-- begin --
(1-op)
If next instruction is a N-in-1-out (inc, add, fma3, etc) or N-in-0-out ("store [sp+2] top-of-stack", etc) micro op
[1] Pop N items from Parser stack, concatenate them with the next instruction into a Hyperblock.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5
next instruction : add(2-in-1-out op)
Hyperblock : { [sp+3] ; $4 ; add ; $5 ; add }
Parser stack after [1] : (empty)
[2] If items remain in Parser stack, output each of them as individual Hyperblock.
[3-1out] If next instruction is 1-out op, leave the Hyperblock made at [1] on the bottom of Parser stack.
ex.
Parser stack after [2] : { [sp+3] ; $4 ; add ; $5 ; add }
-- end --

Thanks for Reverse Polish Notation, you feel free replacing a constant value into an instruction sequence which provides a value.
ex.
$4 ; $3 ; add
can be replaced into
($2 ; $2 ; add) ; ($9 ; $3 ; div) ; add
and more
(($5 ; $3 ; $-13 ; fma3) ; ($1 ; $1 ; add) ; add) ; (($5 ; $4 ; add) ; (return-three-func-addr ; call-from-addr) ; div) ; add
and so on.
But since each instruction sequence has no dependency on others, replacing more than one value give a Hyperblock more than one flow.
I want to avoid concurrent flow in a Hyperblock, thus the recipe allows only one item in Parser stack to hold a Hyperblock.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for an ideal extensible instruction set

Post by agner » Thu Nov 02, 2017 9:48 am

Author: zboson Date: 2016-04-11 03:24
Hi,

I think it might be worth considering other three operand single rounding mode instructions besides FMA.

As I discovered here
http://stackoverflow.com/questions/3057 ... ubledouble
double-double multiplication can be done with

high = a * b;
low = fma(a, b, -high);

However double-double addition cannot be done so simply. If there was a three operand single rounding mode (a+b+c) instruction then double-double addition would be much faster (actually (a+b-c) is perhaps the most interesting). I realize this is may seem to be a very special case. But double-double is a kind of poor man's quad precision float point. So in that sense it's maybe not such a special case for those interested in quad precision floating point. The only thing holding double-double back (if a instruction set already has FMA) is single rounding mode (a+b-c) and this would be a lot simpler to implement than quad precision floating point support I think.


Author: Agner Date: 2016-04-11 08:04
HarryDev wrote:
I think the proposed ISA needs specific instruction for memory copying, initialization and comparison.
My experience with x86 processors is that these microcoded instructions have very long setup times and that software loops with vector registers are faster in most cases. I want to avoid complex microcode instructions. Instead, I have designed instructions for very fast software loops using variable-length vector registers.

A-11 wrote:
BOOST: Berkeley's Out-of-Order Stack Thingy
Thanks for the interesting reference. As I understand the article, it is more difficult to do out-of-order processing on a stack machine than on a conventional machine.

zboson wrote:

double-double multiplication can be done with

high = a * b;
low = fma(a, b, -high);

However double-double addition cannot be done so simply. If there was a three operand single rounding mode (a+b+c) instruction then double-double addition would be much faster (actually (a+b-c) is perhaps the most interesting).

Interesting idea. Actually, I once used that fma trick to get extra precision in a multiplication. So as I understand your idea, you want to do:

Code: Select all

high = a + b
low = a + b - high // (with extra precision in the intermediate)
I think this would be quite costly to implement in hardware, considering the rare use. Instead, you can do this:

Code: Select all

high = a + b;
if (abs(a) > abs(b)) {
   low = b - (high - a);
} else {
   low = a - (high - b);
}
This does not require any extra hardware.


Author: Hubert Lamontagne Date: 2016-04-11 21:03
There is an algo for doubling the precision of floating point adding:

Kahan summation https://en.wikipedia.org/wiki/Kahan_summation_algorithm

A-11 wrote:
It should be too late.
but since I found a research today, I ask for it.

BOOST: Berkeley's Out-of-Order Stack Thingy
https://www.researchgate.net/publicatio ... ack_Thingy

Same as "register renaming" for register machine, these researcher propose "address renaming" for stack machine to implement out-of-order execution.
So I think there are no reasons to prevent stack machine for performance problem.
Since stack machine instruction sets never bother us with number of registers, I think stack machine's ISA is more extensible than register machine's.
Also, stack ISA is easer for compilers as UCSD p-System showed.
it should also hold true for Just In-time Compiling.
That's an interesting architecture. Though I'm not sure I like how it's basically doing 2 or 3 memory accesses per cycle but then removing these accesses out using aggressive renaming, and that the stack and function-local memory accesses seem to not totally consistent with the "real" memory loads/stores through pointers. (otherwise it would have to stall every time no?)

Come to think of it, are there other "not totally consistent" memory models that are useful, performance wise? (aside from the standard multi-core - with - memory-barriers - and-so-forth stuff)

The one thing I do like about stack based architectures is how they reduce the number of register file read/writes, and intuitively I'd be more interested into a model that (1) also has registers (ie it's not using the stack to store values but rather to keep track of very short lived temporary values - and it has the nice property that the lifespan of values is well known) (2) has a limited stack stack depth - something like 16 values with no automatic spilling (ie just enough for calculations, function calls would still be done using the normal RISC mechanism and explicit memory operations for register/stack spills) (3) you're supposed to keep the stack empty as often as possible so that the CPU can automatically figure out which sections it can run in parallel (each time the size falls to 0, that's a new parallelizable section).


Author: Agner Date: 2016-04-12 00:12
Hubert Lamontagne wrote:
The one thing I do like about stack based architectures is how they reduce the number of register file read/writes, and intuitively I'd be more interested into a model that (1) also has registers (ie it's not using the stack to store values but rather to keep track of very short lived temporary values - and it has the nice property that the lifespan of values is well known) (2) has a limited stack stack depth - something like 16 values with no automatic spilling (ie just enough for calculations, function calls would still be done using the normal RISC mechanism and explicit memory operations for register/stack spills) (3) you're supposed to keep the stack empty as often as possible so that the CPU can automatically figure out which sections it can run in parallel (each time the size falls to 0, that's a new parallelizable section).
That sounds useful, but isn't this the same as x87? The main problem with x87 is that you have to do a lot of register swapping to get each operand to the top of the stack when you need it. Each value gets swapped to different positions several times through its life time so that it gets difficult to track where each value is. I haven't seen any compiler that can handle register variables nicely and keep them on the stack. If you want to avoid all that swapping then you need to access registers that are not on the top of the stack, and then the stack idea sort-of disappears. Is there a better way of doing this?


Author: Hubert Lamontagne Date: 2016-04-12 23:59
Agner wrote:
Hubert Lamontagne wrote:
The one thing I do like about stack based architectures is how they reduce the number of register file read/writes, and intuitively I'd be more interested into a model that (1) also has registers (ie it's not using the stack to store values but rather to keep track of very short lived temporary values - and it has the nice property that the lifespan of values is well known) (2) has a limited stack stack depth - something like 16 values with no automatic spilling (ie just enough for calculations, function calls would still be done using the normal RISC mechanism and explicit memory operations for register/stack spills) (3) you're supposed to keep the stack empty as often as possible so that the CPU can automatically figure out which sections it can run in parallel (each time the size falls to 0, that's a new parallelizable section).
That sounds useful, but isn't this the same as x87? The main problem with x87 is that you have to do a lot of register swapping to get each operand to the top of the stack when you need it. Each value gets swapped to different positions several times through its life time so that it gets difficult to track where each value is. I haven't seen any compiler that can handle register variables nicely and keep them on the stack. If you want to avoid all that swapping then you need to access registers that are not on the top of the stack, and then the stack idea sort-of disappears. Is there a better way of doing this?
Not quite... x87 lacks non-strack registers and isn't designed to help the CPU figure out what sections of the calculation are parallelizable. If you start doing register swapping (like the whole fxch+op thing) then it just becomes a softa register file. Here, the CPU starts a new parallelizable section every time the stack size falls to zero. Each parallelizable section has its own stack, and it can be shown that the sections can't interfere (register file accesses still have to be renamed and memory conflicts still have to be solved though, and branch prediction fail can invalidate speculative sections). A similar architecture can be designed using an accumulator+a register file.

Example code (linear interpolation of 16bit data):

Code: Select all

loop:
push r9
shr 16
push short [r8 + stack*2]
pop r0 ;renames r0

;this section can execute in parallel (once registers have been renamed)
push r9
shr 16
add 1
push short [r8 + stack*2]
sub r0 ;waits after renamed r0
push r9
and 0xffff
mul stack
sar 16
add r0
pop short [r7]

;this section can execute in parallel
push r9
add r10
pop r9 ;renames r9

;this section can execute in parallel
push r7
add 2
dup
pop r7 ;renames r7
cmp r11
branch if lower to loop

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Version 1.01

Post by agner » Thu Nov 02, 2017 9:50 am

Author: Agner Date: 2016-05-10 10:28
The instruction set specification is now updated to version 1.01: www.agner.org/optimize/instructionset.pdf

The instruction set has got the name CRISC1 to indicate the compromise between RISC and CISC.

All user-level instructions are now defined.

The most important change is that the length of a vector is now saved in the vector register itself. This has many advantages. When you need to save a vector register in a function or at a task switch, you only have to save the part of the register that is actually used. If the register is unused then you only have to save a zero for the length. Please see the document for further discussion of the advantages.


Author: Hubert Lamontagne Date: 2016-05-13 16:05
Nice. Do you have any plans for writing an assembler or a simulator?
(or other more involved stuff like a verilog/vhdl implementation or a c compiler... though that's a lot of work!)


Author: Agner Date: 2016-05-14 05:23
Hubert Lamontagne wrote:
Do you have any plans for writing an assembler or a simulator?
(or other more involved stuff like a verilog/vhdl implementation or a c compiler... though that's a lot of work!)
That is the plan. Right now I am working on the ELF object file format for this architecture.

But I am very busy with other projects too, so progress will be slow as long as this is a one man project.

I have an idea for how to implement system calls. It is a system call instruction which takes as input a 64-bit ID number. The upper 32 bits is a module ID which identifies a system module or device driver (The system core has ID = 0). The lower 32 bits identify a function within this module. System add-on modules and device drivers do not necessarily have fixed ID numbers because this would require some central authority to assign these ID numbers. Instead, the program will have to ask for the ID number by giving the name of the module. The functions within a module can have fixed or variable ID numbers.

There will be a system function (with fixed ID number) which takes the names of module and function as input and returns the ID number. The ID number can retrieved in this way before the first call to the function.

The ID number can be put into the program in three ways:
  • The most important system functions have fixed ID numbers which can be inserted at compile time.
  • The ID number can be found at load time in the same way as relocation works. The loader will find the ID number and insert it in the code before running the program.
  • The ID number is found at run time before the first call to the desired function.
The system call instruction can get the necessary ID number either in a 64-bit register operand or in an immediate constant with 32 bits for the module ID and 16 bits for the function ID (assuming that the module has no more than 65000 functions).

The calling convention for system functions is the same as for other functions, using registers for parameters and for return value. Any register not used for a parameter to the system function can be used for the ID of the function.

The system call instruction will change the instruction pointer and stack pointer and proceed to system code. The old values of instruction pointer and stack pointer are saved in special registers, to be restored by a system return instruction.


Author: Harry Date: 2016-06-02 02:13
progress will be slow as long as this is a one man project.
Could I suggest that you move the development incl. discussions of CRISC to GitHub instead? This would no doubt make collaboration easier, allow for pull requests, etc. And allow for topic focused issue discussions so the many different aspects of CRISC can be discussed in different threads.

Practically, I would change the CRISC document to a set of GitHub flavored markdown files. These are easily editable and work well with git version control. In time code, simulators etc. can be added to this project as well.

I believe this would make this project more available to a wider audience and increase collaboration. You could setup an organization for this on github and also use several repositories if this would be better.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Public repository

Post by agner » Thu Nov 02, 2017 9:52 am

Author: Agner Date: 2016-06-02 05:16
Harry wrote:
Could I suggest that you move the development incl. discussions of CRISC to GitHub instead?
You are right. This project has developed to more than I initially expected and it is approaching a level where it makes sense to move it to a public repository. What are the pros and cons of the different public repositories, such as GitHub, SourceForge, Savanna, etc.?

Version 1.02 of the CRISC1 document is underway with more detailed specifications of system calls, memory management, object file format, etc. This will no longer fit into a single file.


Author: Harry Date: 2016-06-02 13:54
Not that I have great amount of experience with other public repositories, but GitHub really has a very good story with regards to Issues and discussions related to this. It is also easy to do pull request and discussion around these which encourage collaboration. Additionally, MarkDown is supported as default and rendered on the web page. It also supports tables, which not all markdowns do. In addition, the GitHub for Windows install is great for working with git from the "Git Shell" powershell command line. Since it is git you will always have a local copy of everything, and can use your own custom git repo for backup or similar, if you ever want to move to something else.

GitHub is also becoming the go-to place for open source projects from Microsoft e.g. .NET Core etc. and many others.

I would not recommend sourceforge at all. It is terrible in my view.

I would recommend GitHub.

Just to give an example you can see how the .NET coreclr repo looks like here: https://github.com/dotnet/coreclr and how docs are an integrated part of it.


Author: Harry Date: 2016-06-02 13:56
You could use pandoc to convert Word docx (if that is your format) to markdown see ronn-bundgaard.dk/blog/convert-docx-to-markdown-with-pandoc/


Author: Agner Date: 2016-06-09 11:53
Harry wrote:
Could I suggest that you move the development incl. discussions of CRISC to GitHub instead?
The name CRISC is taken on Github, but there is no activity whatsoever on the name and no contact address. I don't know if it is possible to take over an inactive github account.

Anyway, the world has enough short acronyms. Does anybody have a suggestion for a another name for this project?

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Rethinking DLLs and shared objects

Post by agner » Thu Nov 02, 2017 9:59 am

Author: Agner Date: 2016-05-20 05:07
Windows systems use dynamic link libraries (DLLs) and Unix-like systems (Linux, BSD, Mac OS) use shared objects (SOs). Both types have a number of disadvantages. I have an idea for replacing DLLs and SOs with something more efficient in the new CRISC architecture.

Let me first explain how the current systems work. A normal program goes through the steps of compiling, linking, and loading. The linker joins multiple program modules and library functions together and adjusts all addresses in the linked program. Any absolute addresses in the code are adjusted according to the placement of each module, and all relative addresses from one module to another are calculated. This process is called relocation. The loader may do the relocation once again in case the program is placed at a memory address different from the address that was assumed in the link process. A DLL or SO is linked and loaded in basically the same way as an executable program.

The main difference between a DLL and a SO is that shared objects allow "symbol interposition". It is possible to override a symbol in a SO by defining another symbol with the same name in the calling program. This feature is intended to mimic the behavior of a static link library, but symbol interposition is hardly ever used and it comes at a high cost. Every access to a function or a global variable needs to go via a procedure linkage table (PLT) or a global offset table (GOT). This applies even to internal references inside the SO if the symbol is globally visible. It would be easy to bypass this time-consuming mechanism if the linker and loader allowed it, but for unknown reasons, they don't.

The advantage of using a DLL or SO is that the loaded code can be shared between multiple running programs. This is rarely saving any memory, however, because the library may contain hundreds of functions while you are only using a few of them. It is not uncommon to load a DLL or SO of one megabyte and use only one kilobyte of it.

Another problem that makes DLLs and SOs less efficient than static libraries is that they are scattered around in each their memory block. Each DLL/SO will use at least two memory pages, one for code and one for data. The scattered memory access makes caching less efficient.

Now, my proposal for the CRISC architecture is to get completely rid of DLLs and SOs. Instead, we will have only one type of function libraries that can be used in three different ways:
  • Static linking. This will work the same way as today.
  • Load-time linking. The library is linked with the executable program by the loader when the program is loaded into memory.
  • Run-time linking. The library is loaded by commands in a running program.
In all three cases, we are loading only those functions from the library that are actually needed.

Load-time linking will be easier with the CRISC system than with existing systems because the CODE and DATA sections are independent of each other in the CRISC system. The CODE (and CONST) sections are addressed relative to the instruction pointer, while the DATA section is addressed relative to a special pointer called the data section pointer (DATAP). CODE and DATA can be placed anywhere in memory independently of each other. If extra library functions need to be linked in at load time, then the CODE and DATA sections of the library functions are simply appended to the CODE and DATA sections of the main program, and any new cross references are resolved. The result will be very efficient because the code and data of the library functions are contiguous with the code and data of the main program, so that caching is improved. There are no intermediate import tables or PLTs to slow down the execution.

Run-time linking is used less often. It is needed only when the choice of library depends on user input to the running program, or when a library is loaded explicitly from a not-compiled script language. The loader can use several different methods when run-time linking is requested:
  • The main program may have reserved extra memory space for the library functions. This information is stored in the header of the executable program file. The library function is accessed through a function pointer which is returned by the load_library function. Any DATA section in the library can be addressed through DATAP, using the normal relocation procedure.
  • If there is no reserved space, or the reserved space is too small, then the loader must place the library function somewhere else in memory. If there is a vacant memory space within a distance of +/- 2 GB from the address in DATAP then the same method as above is used.
  • If there is no vacant space within 2 GB of DATAP then the loader can insert a stub that changes DATAP to point to the DATA section of the library function. The function is called through this stub, which changes DATAP when called, and restores the old value of DATAP on return. If the function can throw an exception then the exception handler needs to restore DATAP as well.
  • The library function can be compiled with a compiler option that tells it not to use DATAP. The function will load the absolute address of its DATA section into a general purpose register and access its data with this register as pointer.
If lazy loading of a program module is desired then use the same method as for run-time linking, or put the lazy module into a separate executable file.

Newer versions of Linux have a feature called Gnu indirect function which makes it possible to choose between different versions of a function at load time depending on, for example, the microprocessor version. This feature will not be copied in the CRISC system because it relies on a PLT. Instead, we can make a dispatcher system to be used with load-time linking. The library can contain a dispatch function which tells which version of a library function to load. The loader will first load the dispatch function (possibly using run-time linking into itself) and call it. The dispatch function returns the name of the chosen version of the desired function. The loader then unloads the dispatch function and links the chosen function into the main program. The dispatch function must have access to information about the hardware configuration, command line parameters, environment variables, and anything else that it might need to choose which version of the function to use.

System functions and device drivers are called by using an ID number rather than a function pointer. This ID number can be resolved at link time, load time or run time just like library functions.

The advantages of my proposal are:
  • There is only one type of function libraries. The same library can be used with any of the three methods: static linking, load-time linking, and run-time linking.
  • Only the part of the library that is actually needed is loaded.
  • The code and data of the library is contiguous with the code and data of the calling program in most cases. This makes memory management simpler, avoids memory fragmentation, and improves caching.
    There are no intermediate import tables, procedure linkage tables or global offset tables to reduce the performance.
Any comments?


Author: cv Date: 2016-05-20 12:33
Hi,

How would you deal with non-reentrant functions (and their global data) belonging to the same library in all your three types?
And subsequent non-exported & helper functions internal to the library that could potentially modify those global data?


Author: Agner Date: 2016-05-20 13:51
cv wrote:
How would you deal with non-reentrant functions (and their global data) belonging to the same library in all your three types?
And subsequent non-exported & helper functions internal to the library that could potentially modify those global data?
Library functions should be thread-safe by default because we are aiming at high performance which often means multiple CPU cores running multiple threads. Of course it is possible to make functions that are not thread-safe or reentrant, whatever your reason might be for doing so. A function with static data is likely to be thread-unsafe unless you take special precautions. My proposal allows functions to have a static data section that can be used for whatever you want. If you want multiple library functions to share the same static data then you can either put them into the same module or put labels on the data with public visibility.


Author: Peter Cordes Date: 2016-05-30 08:06

I like the idea of getting rid of the indirection on every call to a library function; that seems like a good plan. Current software is of course designed for the current situation of lazy dynamic linking. This non-lazy dynamic linking may not do very well with some existing software.

GUIs tend to use a lot of library code. I'm worried that copying so much code into private memory for every process will use significant amounts of memory. It won't be touched a lot of the time, but everything that's reachable from the QT or GTK API functions used by a program has to be read from disk and copied into the process's private memory, not just mmap(MAP_SHARED). This means it can only be paged out to swap space, since the library text isn't backed by a file on disk.

Web browsers build much of their code as libraries. Many browsers use multiple processes instead of just multiple threads, as an extra barrier against vulnerabilities. On Linux, I assume chrome just fork()s, so memory that isn't modified after that is shared. AFAIK, Windows doesn't have an equivalent system call, so each process might be stuck with its own private copy of all the library code.

In summary, I think the current state of desktop software is too bloated for this to be a good idea.

If we hope to ever see any of this proposal get used for real (at some point in the far future?), I think we might need some kind of lazy dynamic linking, or at least shared text segments for libraries. Vendors will always want to be able to sell cheap machines with the minimum amount of memory.

Non-lazy linking sounds great for long-running processes that don't overdo it with libraries (e.g. a web server process), but I'm also worried about higher startup overhead for commands used from shell-scripts. With lazy dynamic linking, code for functions that are never called (because that command-line option wasn't used) doesn't ever have to be read into memory. OTOH, most programs will use the same libc functions, and having them hot in the pagecache means it's cheap for the dynamic linker to memcpy them.

It would be interesting to run some tests comparing shell-script speed (cut/grep/sed/cat/find) with Linux lazy dynamic linking vs. this proposal. I think this could work on x86, or any other arch. Linux supports custom executable file formats, or for this experiment we could use ELF with a different dynamic linker as a custom ELF-interpreter. IDK if we'd want a different library object-file format, though, to make it easy and fast to recursively find all the functions a library-function could itself call.


Author: Agner Date: 2016-05-30 11:11
Peter Cordes wrote:
GUIs tend to use a lot of library code. I'm worried that copying so much code into private memory for every process will use significant amounts of memory.
If the GUI library is big, it may run in a separate process. The cost of this is that communication between library and main program will be slower because it needs to switch the memory map.
most programs will use the same libc functions
I often use static linking of libc. It works fine and it doesn't copy too much code into the executable file. If a GUI library has a proper modular design, it should be possible to link it statically without copying everything. It might be possible to test this with some open source GUI library to see how much code will be copied with static linking.
Linux supports custom executable file formats, or for this experiment we could use ELF with a different dynamic linker as a custom ELF-interpreter.
That's interesting. I guess this requires that you make your own loader. I would still use ELF format but add a custom section type.


Author: Joe Duarte Date: 2016-06-17 00:41
I don't think bloated GUI libraries will be an issue. Remember that Agner is proposing an entirely new ISA. Windows 8 and Mac OS 10.11 will never be ported to this ISA. Let's imagine that Agner's proposal garnered good reviews and wide industry attention. In an optimal scenario, we might see some version of CRISC implemented in silicon in 2023 or so. Realistically, any platforms deployed on it will be new, possibly built from scratch, like unikernels or Microsoft Research's library OSes. GUIs can be implemented very lightly now with immediate-mode libraries like imgui: https://github.com/ocornut/imgui

Statically-linked executables are gaining momentum with Go and, on the C side, musl. A musl executable is much lighter than a glibc one.

Agner, it seems like your proposal could either eliminate the need for tree-shaking, or require a new kind of tree-shaking, depending on implementation details. Would code be tree-shaken by default, in that nothing is included unless it's used, kind of an opt-in vs opt-out scenario?

You'll have to cover the security angles here or no one will accept it. There are lots of pointer memory management add-ons to C right now that make decisions about which kinds of pointers will be stored where and so forth. How will your ISA integrate with those methods? I'm thinking of things like Code Pointer Integrity and SafeStack: http://dslab.epfl.ch/proj/cpi/


Author: Agner Date: 2016-06-18 00:44
Joe Duarte wrote:
GUIs can be implemented very lightly now with immediate-mode libraries like imgui: https://github.com/ocornut/imgui

Statically-linked executables are gaining momentum with Go and, on the C side, musl. A musl executable is much lighter than a glibc one.

Agner, it seems like your proposal could either eliminate the need for tree-shaking, or require a new kind of tree-shaking, depending on implementation details. Would code be tree-shaken by default, in that nothing is included unless it's used, kind of an opt-in vs opt-out scenario?
All function libraries will work like static libraries (*.lib in Windows, *.a in Unix) in the sense that it includes only what is needed.
You'll have to cover the security angles here or no one will accept it. There are lots of pointer memory management add-ons to C right now that make decisions about which kinds of pointers will be stored where and so forth. How will your ISA integrate with those methods? I'm thinking of things like Code Pointer Integrity and SafeStack: dslab.epfl.ch/proj/cpi/
Thanks for the interesting links. You are touching on an important topic. We have to think security into the system right from the beginning.
I have proposed a dual stack system where the call stack is separate from the data stack. A buffer overflow will be unable to compromise the return address of a function. Jump tables for switch/case statements and virtual function pointer tables for C++ polymorphism can be placed in the CONST section which is read-only. What remains is function pointers declared explicitly in the program. These may be placed in a protected static memory area or on the heap at a non-predictable address. An input buffer may be placed in its own sandbox where it is protected from overflow.


Author: Bigos Date: 2016-06-18 04:40
About security, some time ago I had something interesting (or not) in mind that could be useful for a new ISA.

What about making a special "function entry" instruction that has to be placed as a first instruction of every function? The call instructions would check whether they point to this special instruction and bail out if they are not. This way we are no longer able to call into a middle of a function. For tail-calls, we could make an instruction that behaves like jump but also checks for this restriction. I am not sure whether the extra instruction (or maybe a bit of a first instruction?) is worth it, but we could combine it with what is usually placed as a first instruction of a function (like stack manipulation in order to reserve local variables) in order to save space.

We could also place some kind of function metadata there (like used registers, which are for input, which are for output, etc.).


Author: Freddie Witherden Date: 2016-06-02 17:37
With regards to static vs dynamic linking it is probably worth considering the following note by Ulrich Drepper:

https://www.akkadia.org/drepper/no_static_linking.html

who was the maintainer of glibc for several years and the author of the excellent series on "What Every Programmer Should Know About Memory". On balance -- and speaking as someone who comes from the HPC side of things -- I am inclined to agree with his points.

With regards to: "It is not uncommon to load a DLL or SO of one megabyte and use only one kilobyte of it." on most (all?) systems .so's are mmap'ed and so benefit from on-demand paging; hence you only pay for cost for what you use viz-a-viz the page size. Further, while an application may call only a small number of functions from an .so those functions inside of the .so may go on to call a large number of additional -- internal -- functions. These 'iceberg' .so's are very common for GUI libraries and the such like. Of course these internal functions need not be exported and hence should not suffer from a performance penalty beyond that associated with being PIC.


Author: Agner Date: 2016-06-04 03:12
Freddie Witherden wrote:
With regards to static vs dynamic linking it is probably worth considering the following note by Ulrich Drepper
Dynamic linking relies heavily on memory paging. My proposal for CRISC1 is to avoid memory paging completely. I am replacing dynamic linking with load-time linking which can do the same as dynamic linking with respect to updating a library without updating the calling program. There is even a feature for choosing between different libraries or functions at load time. Very large libraries, such as GUI frameworks, can be loaded as a separate process.

There are significant performance costs to DLLs and SOs which I want to avoid. Each DLL/SO has its own memory pages for code and data. You will be wasting a whole 4kB page even if the function uses only a few hundred bytes. The data section of a DLL/SO cannot be shared. The code section can be shared in most cases, but it is not shared if it contains relocated cross-references. The amount of memory that you are saving by sharing functions between multiple processes is very small compared to the amount of memory that you are wasting by loading a lot of other functions that your program is not using.

The dynamic libraries are scattered around in memory which means poor caching. Many functions are placed at the start of a memory page. This means that cache line sets with low numbers are used disproportionately more, or you need a more complicated cache design to distribute the load on cache line sets more evenly.

If a DLL contains many functions and some of these functions - not the ones you are using - link to other DLLs, then you will load these other DLLs unnecessarily, or suffer the unpredictable response time of lazy loading.

Unix-style shared objects have further performance losses due to the requirement of the so-called position-independent code, which actually involves something completely different, namely symbol interposition. 32-bit x86 shared objects have an additional performance problem because the instruction set has no position-independent addressing mode.


Author: Freddie Witherden Date: 2016-06-04 13:49
Out of interest have you checked out the work of Darek Mihocka who some years ago wrote a multi-part series on fixing x86 (and, interestingly, proposed eliminating hardware paging):

www.emulators.com/docs/nx03_10fixes.htm
www.emulators.com/docs/nx04_malloc.htm
www.emulators.com/docs/nx05_vx64.htm

With regards to the cost of PIC I do not believe that it is as high as it used to be. Case in point the majority of executable shipped by modern distributions are PIC as a means of improving the effectiveness of ASLR. Yet, when this change happened, I do not recall seeing complaints from users or any serious performance degradation.


Author: Agner Date: 2016-06-06 14:25
Freddie Witherden wrote:
have you checked out the work of Darek Mihocka who some years ago wrote a multi-part series on fixing x86 (and, interestingly, proposed eliminating hardware paging)
Thank you for the interesting links. His experiments show that the costs of the complicated memory management and the huge page tables is even higher than I thought. I agree with him that we need a fundamental redesign.

Many of the security problems that Darek discusses can be solved with CRISC1. The memory map in my proposal is so small that it is no problem to give each thread in a process its own memory map and its own private data space that cannot be accessed from other threads. Do you know if any other systems have memory protection between threads? I can't find any (other than emulators). Of course we have thread-local storage, but that does not guarantee that one thread cannot corrupt data belonging to another thread in the same process in case of program bugs or malicious hacking.

I don't see any reason why different threads should have access to each other's private data. Synchronization structures (e. g. semaphores) and resource sharing between threads should go through the shared main memory of the application, not the private memory of one thread.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Is it better to have two stacks?

Post by agner » Thu Nov 02, 2017 10:03 am

Author: Agner Date: 2016-06-05 13:26
In the beginning of this thread I argued against having a link register. Storing return addresses on the stack is simpler.

Now I wonder if it is better to have two stacks: a call stack for return addresses and a data stack for the local variables of functions. The call stack will be quite small in most cases because the nesting level of function calls is limited in most programs. We can have a small rolling stack on the chip which is used most of the time. If the on-chip stack overflows then it must be spilled to memory. Let's take an example: we have a rolling stack on the chip with 16 entries. We are running a program where function calls are nested 20 deep. The on-chip stack will be copied to memory at call level 17. The on-chip stack entries are overwritten one by one (oldest first) on each deeper call. After call number 20 there will be 4 entries that are overwritten. The first 16 returns after the deepest call can take the return addresses from the on-chip stack. We don't have to reload the on-chip stack from memory until we come down to level 4. We will never have to spill the on-chip stack to memory inside a loop more than on the first iteration unless there are very deep function nesting or recursive functions inside the loop. In other words, the costs of spilling the on-chip stack to memory are minimal because it will not occur repeatedly in a loop except in recursive functions.

In fact, modern microprocessors already have such a rolling call stack on the chip. It is used for predicting return addresses. We might as well use this structure as the genuine call stack rather than just a shadow stack used for prediction. The prediction of return addresses will then be simple and perfect, of course.

There is also a security advantage to having a separate call stack. The return address of a function cannot be overwritten by software bugs or malicious buffer overflow attacks. Legitimate attempts to change the return address of a function will also be prevented, of course, but this is bad programming anyway because it wrecks the prediction mechanism.

The mechanism of the on-chip stack can be hidden in special registers. The application does not need to access the stack pointer of the call stack, except for stack unwinding in the exception handler.

The cost of having two stacks is the complexity of saving the on-chip stack to memory when it overflows. This can be implemented in hardware or software. Memory management will also be a little more complex because there are two stacks that can overflow. The size of both stacks can be predicted by using the method explained in my document, except for recursive functions.

I will propose, tentatively, to allow both principles - one stack or two stacks - in the CRISC1 architecture. It does not matter to the software whether there is one or two stacks except when function parameters are saved on the stack. A function needs to know the addresses of its parameters relative to the stack pointer, and this depends on whether there is a return address in between. It is rare that we will have parameters on the stack because the first 16 general purpose registers and the first 16 vector registers can be used for function parameters.

If we want the same software to be compatible with both one-stack and two-stack systems then we need to solve the problem of the address of parameters on the stack, however rare it may be. The simplest solution is to put an empty space on the data stack where the return address would be if we have a separate call stack. But I want to suggest a smarter solution: don't put parameters on the stack. If there are more parameters than registers then put the extra parameters in a list and use one register to point to this list. This solution is simple and efficient. We are getting rid of the old discussion of which order of parameters on the stack is better, and whether the stack should be cleaned up by caller or callee.

So this is my proposal. Small simple systems can have one unified stack, and large systems where performance or security is important can have two stacks. The size of the on-chip call stack is implementation dependent. The calling convention is changed so that parameters are never saved on the stack. Application programs don't need to care whether there is one or two stacks, but the stack unwinding mechanism in the exception handler needs to know, of course.


Author: Hubert Lamontagne Date: 2016-06-07 16:46
Having a second stack in a small separate memory just for IPs does sound workable. It does have the advantage that CALL and RET aren't memory operations anymore, which means they take a single micro-op, and you get real CALL/RET opcodes instead of having to decompose them into jump-and-link+store LR and load LR+jump LR. So given the appropriate hardware implementation, it could reasonably translate into a small speed gain for code that does a lot of function calling (you're using one less data cache port and micro-op on every CALL and every RET).

This means that it could reasonably be worth the extra hardware complexity - I think it would probably need a write buffer and register-renamed SP so that it can be rolled back in case of a branch prediction failure/page fault/unpredicted reordered store to already loaded memory address/etc, so you have to factor this into the cost.


Author: Eden Segal Date: 2016-06-13 22:40
What about debugging? Putting stuff on the cpu is good for performance, but if the table is not public for viewing, there are 16 function calls you don't know about.
While you can say it will be worked around in debug mode, debugging a optimized application is still necessary, and really hard on the debugger as it is.


Author: Agner Date: 2016-06-13 23:36
Eden Segal wrote:
What about debugging?
No problem. There will be instructions for accessing the on-chip call stack. These instructions must be accessible to debugger and exception handler. They may or may not be accessible to the application in user mode.


Author: Hubert Lamontagne Date: 2016-06-14 21:28
The other reason I can think of to have user-accessible stack is supporting coroutines.


Author: Agner Date: 2016-06-14 23:59
Hubert Lamontagne wrote:

The other reason I can think of to have user-accessible stack is supporting coroutines.

If coroutines are implemented with a call stack, it would be filled up. Are you thinking about removing an entry from the call stack before calling another routine? The same functionality can be obtained with a tail call or a state machine.

The next version of my proposal will have a calling convention that makes tail calls possible in all cases. In the rare case that there are more than 16 integer parameters or 16 vector parameters to a function, the remaining parameters will not be stored on the stack but in a list pointed to by a register. The same method applies to a variable arguments list.


Author: Hubert Lamontagne Date: 2016-06-15 19:15
Agner wrote:

Hubert Lamontagne wrote:
The other reason I can think of to have user-accessible stack is supporting coroutines.
If coroutines are implemented with a call stack, it would be filled up. Are you thinking about removing an entry from the call stack before calling another routine? The same functionality can be obtained with a tail call or a state machine.

The next version of my proposal will have a calling convention that makes tail calls possible in all cases. In the rare case that there are more than 16 integer parameters or 16 vector parameters to a function, the remaining parameters will not be stored on the stack but in a list pointed to by a register. The same method applies to a variable arguments list.
Well, within a high-level language engine you can completely bypass the normal function calling, which lets you replace all execution flow with a state machine or do tail calls for functional programming yes. I was thinking of more like the context of something like C++ or ASM, where you're stuck with calling conventions and where coroutine handling libraries often allocate a separate stack for each coroutine using malloc(), then swap around the stack pointer.

As for the cases where you have more than 16 integer or vector/float parameters, how do you allocate that list for extra parameters? Isn't the simplest way to do that to allocate it on the stack? Or would you call malloc()?


Author: Agner Date: 2016-06-15 23:43
If you want multiple stacks for coroutines, why not use multiple threads? Anyway, a coroutine switch would work much the same as a thread switch does in the operating system.

Hubert Lamontagne wrote:
As for the cases where you have more than 16 integer or vector/float parameters, how do you allocate that list for extra parameters? Isn't the simplest way to do that to allocate it on the stack? Or would you call malloc()?
The caller can put the list anywhere it wants. Possibly on the stack. My point is that the called function will not address its parameters relative to the stack pointer. We get rid of the old discussion about which order of parameters on the stack is more logical and who should clean up the stack.


Author: Hubert Lamontagne Date: 2016-06-16 14:22
Agner wrote:
If you want multiple stacks for coroutines, why not use multiple threads? Anyway, a coroutine switch would work much the same as a thread switch does in the operating system.
Yeah, the catch is that a thread switch goes through the thread scheduler. That's totally fine if you're doing something that doesn't need any tight timing (ex: server software), but it's a problem in very real time apps - I'm thinking of the case of trying to write a video game where every object gets a time slice on every frame, you end up with hundreds of threads and if the OS decides to put even just one of them in the freezer for a few dozen ms, you get frame stutter... Though TBH games aren't normally written this way anyways (plus it makes saving/restoring the game state hard).


Author: Agner Date: 2016-06-16 23:44
Hubert Lamontagne wrote:
the catch is that a thread switch goes through the thread scheduler. That's totally fine if you're doing something that doesn't need any tight timing (ex: server software), but it's a problem in very real time apps
You are right. We need to think about real time applications. Is there anything special we need to do to support real time operating systems? It would be nice to be able to reserve one or more CPU cores in a multicore processor for a particular critical task, but that is an operating system issue. The hardware just needs to support fast task switching. We can have extremely fast task switching if the CPU has multiple on-chip memory maps that it can switch between.


Author: Hubert Lamontagne Date: 2016-06-17 10:07
Agner wrote:
Hubert Lamontagne wrote:
the catch is that a thread switch goes through the thread scheduler. That's totally fine if you're doing something that doesn't need any tight timing (ex: server software), but it's a problem in very real time apps
You are right. We need to think about real time applications. Is there anything special we need to do to support real time operating systems? It would be nice to be able to reserve one or more CPU cores in a multicore processor for a particular critical task, but that is an operating system issue. The hardware just needs to support fast task switching. We can have extremely fast task switching if the CPU has multiple on-chip memory maps that it can switch between.
It's a bit of an application-specific thing.

For instance, I'm 99% sure MacOS has a special case in its scheduler for the audio thread, so that when the sound hardware sends a "the sound buffer is 50% empty, fill the other half" interrupt, the sound code runs straight off the interrupt and literally outprioritizes EVERYTHING ELSE (including hardware drivers!), completely disables priority inversion, completely bypasses any sort fair time slice allocator and so forth. If the audio thread falls on a locked mutex, it might even elevate the thread holding that mutex to this crazy "screw everything else" priority - this is necessary to use mutexes or things locked by mutexes like malloc() in the audio thread, otherwise trying to lock the mutex might result in a task switch to a low priority GUI thread which then gets preempted, and then you get an audio glitch.

In Windows, the scheduler is generally designed for networking so you generally get screwed over: at very least, you need to install a special app to get low latency audio (ASIO4ALL), you need to spend half an hour finetuning buffer length, any badly written driver can completely make it impossible (for instance there was an Intel network driver that would periodically lock everything for 5ms), and even when it does work, all other applications get no sound because you have to bypass the sound mixer because it runs at way too low priority (and on way too large blocks). In Linux, afaik this sort of scheduling is possible (although it's not really meant for it), but requires special flags in the thread scheduler priority and in the mutexes (and possibly a kernel patch), and there's a video from Android OS engineers about getting a low latency audio path in Android using that (tdlr version: it took some work and many versions!).

Another case I can think of is game consoles... if you have 8 cores, does it make any sense to run ANYTHING ELSE than the currently played game on 7 of those cores?


Author: Freddie Witherden Date: 2016-06-22 11:40
With regards to exposing a second stack it is worth considering a recent proposal from Intel term Control-flow enforcement technology: https://software.intel.com/sites/defaul ... review.pdf

The principle is simple: require that the instruction immediately following a jump is an ENDBRANCH instruction (a NOP on current systems). There are also a series of extensions to enable the shadow stack to be manipulated by debuggers and the like. It seems like a reasonable model of how someone would go about exposing and managing a second return call stack.

With regards to supporting two operating modes (with and without a second stack). My thoughts here are that it is simply not worth it. If your platform does not support some sort of hardware functionality to prevent these ROP type attacks then compilers will kludge around it on your behalf. Case in point are the stack smashing protection technologies which are present in many compilers. Obviously, system developers feel that the impact on performance is more than offset by the resulting improvements in security.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Now on Github

Post by agner » Thu Nov 02, 2017 10:06 am

I have now moved this project to Github as some of you recommended. This will facilitate the collective development of software toolchain and hardware implementation.

The name CRISC was taken, so I have changed the name to ForwardCom. It stands for forward compatible computer system.

I have converted the manual to LaTex so that Github can show diffs. The pdf manual is built from many tex files. The pdf manual version 1.02 is here https://github.com/ForwardCom/manual/ra ... ardcom.pdf.

New in version 1.02:
  • Security features added.
  • Support for dual stack.
  • Some instructions and formats modified, including more formats for jump and call instructions.
  • System call, system return and trap instructions added.
  • New addressing mode for arrays with bounds checking.
  • Memory management and ABI standards described in more detail.
  • Instruction list in comma separated file instruction_list.csv to be used by assemblers, emulators, debuggers, etc.
  • Object file format defined in file elf_forwardcom.h

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for instruction set - now on Github

Post by agner » Thu Nov 02, 2017 10:51 am

Author: Agner Date: 2016-06-26 02:48
The discussion in the thread Proposal for an ideal extensible instruction set has been very fruitful and led to a lot of new ideas. I don't know where this will lead us, but the project looks so promising that it is worth pursuing further.

I have put the project on Github to facilitate collective development of the instruction set, software toolchain, system standards, and hardware implementations. The address is https://github.com/ForwardCom. The development of hardware should be done in collaboration with the Opencores forum.

The name CRISC was taken, so I have changed the name to ForwardCom. It stands for "forward compatible computer system".

The latest version of the manual is at https://github.com/ForwardCom/manual or http://www.agner.org/optimize/#instructionset. I have also reserved the domain name http://www.forwardcom.info.

The ForwardCom project includes both a new instruction set architecture and the corresponding ecosystem of software standards, application binary interface (ABI), memory management, development tools, library formats and system functions. Here are some highlights:
  • The ForwardCom instruction set is a compromise between the RISC and CISC principles, combining the fast and streamlined decoding and pipeline design of RISC systems with the compactness and more work done per instruction of CISC systems.
  • The ForwardCom design is scalable to support small embedded systems as well as large supercomputers and vector processors without losing binary compatibility.
  • Vector registers of variable length are provided for efficient handling of large data sets.
  • Array loops are implemented in a new flexible way that automatically uses the maximum vector length supported by the microprocessor in all but the last iteration of a loop. The last iteration automatically uses a vector length that fits the remaining number of elements. No extra code is needed to deal with remaining data and special cases. There is no need to compile the code separately for different microprocessors with different vector lengths.
  • No recompilation or update of software is needed when a new microprocessor with longer vector registers becomes available. The software is guaranteed to be forward compatible and take advantage of the longer vectors of new microprocessor models.
  • Strong security features are a fundamental part of the hardware and software design.
  • Memory management is simpler and more efficient than in traditional systems. Various techniques are used for avoiding memory fragmentation. There is no memory paging and no translation lookaside buffer (TLB). Instead, there is a memory map with a limited number of sections with variable size.
  • There are no dynamic link libraries (DLLs) or shared objects. Instead, there is only one type of function libraries that can be used for both static and dynamic linking. Only the part of the library that is actually used is loaded and linked. The library code is kept contiguous with the main program code in almost all cases. It is possible to automatically choose between different versions of a function or library at load time, based on the hardware configuration, operating system, or user interface framework.
  • A mechanism for calculating the required stack size is provided. This can prevent stack overflow in most cases without making the stack bigger than necessary.
  • A mechanism for optimal register allocation across program modules and function libraries is provided. This makes it possible to keep most variables in registers without spilling to memory. Vector registers can be saved in an efficient way that stores only the part of the register that is actually used.

Author: Joe Duarte Date: 2016-07-04 01:49
Hi Agner – Can you recap how you resolved the objection that Hubert had to unified int/floating point registers? He had argued for the benefits of dedicated FP registers, but I can't find the resolution (the thread became quite long).

Other questions:

2. How do you situate your ISA in the context of the current trend toward heterogeneous computing, GPGPU, AMD's HSA, OpenCL, OpenMP, OpenACC, FPGAs, etc.? Will the new ISA help, hinder, or be neutral to heterogeneous programming?

3. This is an interesting time to develop a new ISA because of the oft-claimed end of Moore's Law scaling. How do the anticipated physics of the next 10 – 20 years of microprocessor design impact ForwardCom? Is the physical substrate irrelevant? (I'm thinking of III-V semiconductors, mostly) Or is ForwardCom shaped by present-day silicon and fab assumptions? It seems like some aspects of an ISA must be affected by the underlying physics and engineering constraints. For example, whether unified registers can be efficiently implemented, or whether orthogonal instructions can be efficiently implemented, and definitely how variable-length vector registers will perform.

4. Is ForwardCom extensible to a unified RAM (volatile) and non-volatile memory space? There are a few proposals out now, including Moneta D, that rethink the file system and make calling solid state storage a lot like calling a memory address (or just the same thing). They mostly hinge on faster-than-SSD media, like Intel and Micron's new 3D XPoint, phase-change memory, etc. but you could probably do it with fast SSDs.

I think it would be interesting if applications could set a few custom constants at start time. A while back I asked about having useful constants hardwired in the CPU (e.g. π), and you guys explained why it probably wouldn't be worth it. I think letting each application tell the CPU to save a few constants that the application will often use might be more useful. The point is to be able to invoke the constant without an immediate, but just with a much smaller shortcode (two or three bits). It's kind of like an application-specific cache (an idea which Intel has implemented a bit), but much, much smaller. The programmer might not even need to know or designate the constants – it could be empirically determined by the compiler. This would only help certain kinds of applications, but that's true of many things – it would be a way to reduce code size but more importantly to have a sticky, stable cache immune from misses and eviction. An expanded version of this would be to be able to define and save specific, short code sequences. (I see your Chapter 6, but it seems like a placeholder for now.)


Author: Agner Date: 2016-07-04 05:47
Joe Duarte wrote:
Hi Agner – Can you recap how you resolved the objection that Hubert had to unified int/floating point registers?
Huberts arguments are here http://www.agner.org/optimize/blog/read.php?i=421#472.

My reason for using the same registers for floating point scalars and floating point vectors is that floating point code often contains calls to mathematical function libraries. Such a library can use variable-length vectors for parameters and results. The same library function can be called with a scalar or a vector because a scalar is simply handled as a vector with one element. This will make it easy for the compiler to vectorize the code. ForwardCom has a separate register set for integer scalars. These are useful for pointers, loop counters, array indexes, etc. that don't need vectors, and they are easier to save and restore than vector registers.
How do you situate your ISA in the context of the current trend toward heterogeneous computing, GPGPU, AMD's HSA, OpenCL, OpenMP, OpenACC, FPGAs, etc.? Will the new ISA help, hinder, or be neutral to heterogeneous programming?
People are sometimes doing non-graphic calculations in a GPU. This is complicated and awkward and has many problems with data transfer and synchronization. ForwardCom will eliminate this need. You may have a system with two sets of ForwardCom processors, one for graphics and one for everything else, but in most cases you will just have a system with several identical all-purpose ForwardCom processors.
This is an interesting time to develop a new ISA because of the oft-claimed end of Moore's Law scaling. How do the anticipated physics of the next 10 – 20 years of microprocessor design impact ForwardCom? Is the physical substrate irrelevant?
The physics affects the clock frequency and the maximum vector length that can implemented efficiently. Some kind of layered or 3-D chip could possibly help making very long vectors more attractive, because it reduces physical distances on the chip.
Is ForwardCom extensible to a unified RAM (volatile) and non-volatile memory space? There are a few proposals out now, including Moneta D, that rethink the file system and make calling solid state storage a lot like calling a memory address (or just the same thing). They mostly hinge on faster-than-SSD media.
Yes. The memory management system of ForwardCom divides a program's memory into two blocks: (1) Code and constants, (2) read/write data including static data, stack and heap. Block (1) is addressed relative to the instruction pointer, while block (2) is addressed relative to a special register called data section pointer (DATAP) and the stack pointer. Nothing in one block is addressed relative to something in the other block. Each of the two blocks can be placed at any address independently of the other block. We can easily place block (1) in non-volatile memory if only it has fast read access. It doesn't need fast write access. Block (2) should be placed in volatile RAM. Microcontrollers with Harvard architecture are like this anyway.
I think it would be interesting if applications could set a few custom constants at start time. A while back I asked about having useful constants hardwired in the CPU (e.g. π), and you guys explained why it probably wouldn't be worth it. I think letting each application tell the CPU to save a few constants that the application will often use might be more useful. The point is to be able to invoke the constant without an immediate, but just with a much smaller shortcode (two or three bits). It's kind of like an application-specific cache
ForwardCom can have constants embedded in instructions. This includes integer constants of all sizes and floating point constants of half, single and optionally double precision (requires 1, 2 and 3 word instruction size, respectively). Your idea would require an extra register set for saving and restoring data that will be needed later. This extra register set would just require two extra instructions for copying data from a normal register to an extra register and vice versa. This could be useful for many purposes and it would reduce the pressure on the data cache. The only problem I can see is that it will be an extra complication to the function calling convention because we need to know whether a function modifies the extra registers or not.
An expanded version of this would be to be able to define and save specific, short code sequences. (I see your Chapter 6, but it seems like a placeholder for now.)
That would be more complicated.


Author: Hubert Lamontagne Date: 2016-07-06 09:51
A couple questions:

- What's the exact algo for determining if an OP contains a memory load?

- What's the exact algo for determining if an OP contains a memory store?

Generally you need to know this as soon as possible in the pipeline, because a lot of out-of-order CPUs do the memory ops in-order well before the other stuff. And even when memory ops are reordered, they need a separate reordering to preserve the proper store sequence. (note: this must include ALL cases)

Bonus question:

- What's the exact algo for determining if an OP is integer or FPU/vector?

Same as above, but this is to figure out which ops go into the integer reg-rename + queue + ALUs + retirement, and which ones go to the FPU/vector side.


Author: Agner Date: 2016-07-06 11:05
Hubert Lamontagne wrote:
What's the exact algo for determining if an OP contains a memory load?
Thanks for the questions. This is important for efficient decoding.

The format is determined by the instruction length (IL) and mode bits, see table 3.14 in the manual. The following formats have a memory operand: 0.4 - 0.9, 2.0, 2.2, 2.4.x, 2.8.x (see table 3.14) and two tiny instructions (see table 4.6 and 4.7).

The only mandatory instructions that have memory store are the store instruction (see table 4.5) with the same formats as above, and two tiny instructions (table 4.6 and 4.7). Other instructions with memory store are optional, listed in table 4.10.
What's the exact algo for determining if an OP is integer or FPU/vector?
General purpose registers:
  • The mode field is 0 or 1 when ignoring M (format 0.0, 0.1, 0.8, 0.9, 1.0, 1.1, 1.8, 2.0, 2.1, 2.8.x, 2.9, 3.0, 3.1)
  • Format 2.6
  • Tiny instruction with OP1 < 16
  • Jump instructions with no M field (format 1.5, 2.7.2, 2.7.3, 2.7.4) or M = 0.
  • Registers operands for pointer, index and vector length are always general purpose registers
Vector registers:
  • The mode field is 2-7, except format 2.6
  • Tiny instruction with OP1 ≥ 16
  • Jump instructions with an M field (format 1.4, 2.7.0, 2.7.1, 3.0.1) and M = 1
Jump instructions also have a separate category:
  • Format 1.4 and 1.5
  • Format 2.7 with OP1 = 0-4
  • Format 3.0 with OP1 = 1
As you can see, I have tried to make the rules reasonably consistent, but some exceptions to the rules have been necessary for the sake of making instructions as compact as possible.


Author: Hubert Lamontagne Date: 2016-07-07 01:45
Some questions about your memory model:

- For every Load/Store, how is the CPU going to differentiate which memory map entry it falls into? If you have 16 memory map entries, does it compare every load/store address to all 16 block start addresses? (which requires 16 comparators in hardware if you want to calculate it in a single cycle?! or multiple cycles of latency and 4~5 comparators?!?)

- If I understand your memory allocation scheme correctly, every program's heap size is determined beforehand, and when it grows beyond that size, you absolutely have to use up one more memory map entry? Wouldn't that use up memory map entries really fast and go through hundreds of memory map entries for complex applications like a web browser? And then force the copy of hundreds of megabytes of data when something like a video game runs out of memory map entries?

- Wouldn't you at least require memory blocks to be aligned to L1 cache way size? (usually 4k~32k on modern CPUs?)... Or at very least, wouldn't it have to be aligned to L1 cache line size? (32~64 bytes typically?)... The problem with non-cache-aligned memory blocks (and more specifically addends) is that it forces you to add up this memory offset after calculating memory address but before reading it from the cache, since you no longer have a 1:1 relationship between cache lines and virtual addresses. Forcing alignment lets you do this offsetting *in parallel* with the cache access, which saves at least 1 full cycle of cache latency on, like, everything. The other option is to use a virtually-addressed cache, but this has the downside that it introduces aliasing problems (and you have to flush the cache when the mapping changes).


Author: Agner Date: 2016-07-07 02:56
Hubert Lamontagne wrote:
Some questions about your memory model:

- For every Load/Store, how is the CPU going to differentiate which memory map entry it falls into? If you have 16 memory map entries, does it compare every load/store address to all 16 block start addresses? (which requires 16 comparators in hardware if you want to calculate it in a single cycle?! or multiple cycles of latency and 4~5 comparators?!?)
Both methods are allowed, or some hybrid. Comparators are not that expensive so we can probably afford to have 16 comparators. It takes several clock cycles to fetch a memory operand from cache or write a memory operand to cache, so it is also possible to use a binary tree algorithm with 2-3 comparisons per clock cycle while fetching the operand speculatively at the same time. Most memory accesses will use the same one or two map entries as previous accesses so we can save power by starting each search where the last search ended.
If I understand your memory allocation scheme correctly, every program's heap size is determined beforehand, and when it grows beyond that size, you absolutely have to use up one more memory map entry? Wouldn't that use up memory map entries really fast and go through hundreds of memory map entries for complex applications like a web browser?
This is a weak spot indeed. I am imagining a scheme where the heap is extended with memory blocks that grow exponentially in size. If a 1 MB heap is exhausted you allocate 2 or 4 MB more. If they get exhausted you allocate 16 MB more, etc. In addition to this, we should do everything to predict the heap size. As a last resort, move data if the memory becomes too fragmented (this will cost a delay, of course, but less that the delay of swapping memory to disk).

With these methods, we should still be able to keep the memory map so small that it is more efficient than a TLB with fixed page sizes. The maximum size of the memory map is implementation dependent.
Wouldn't you at least require memory blocks to be aligned to L1 cache way size? (usually 4k~32k on modern CPUs?)... Or at very least, wouldn't it have to be aligned to L1 cache line size? (32~64 bytes typically?)
In most cases, yes. It depends on how the cache is designed. I have just specified 8 bytes as a minimum that all implementations must obey, including small systems with no cache. Smaller memory blocks may be used for special purposes such as shared memory blocks and protected sandboxes for secure input buffers. Maybe I should indicate more clearly the the 8 bytes is a minimum, not a requirement.

Locked