The original proposal, 2 years ago

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

The original proposal, 2 years ago

Post by agner » Thu Nov 02, 2017 6:56 am

Two years ago I made the first proposal for a new instruction set. I am copying the discussion thread from the old messageboard (http://www.agner.org/optimize/blog/read.php?i=421) to the new one here. As you can see, a lot has happened since then.
Original date: 2015-12-27

Table of Contents
  • Introduction
  • Basic architecture
  • Proposed code structure
  • FPGA
  • Extensibility
  • Portability
  • ABI and calling conventions
  • Assembly syntax
  • Summary of advantages
  • Can existing instruction sets be fixed?
Introduction
An instruction set is a standardized set of machine instructions that a computer can run.

There are many instruction sets in use. To introduce a new instruction set is not an easy thing to do because it breaks the compatibility with existing software and hardware. Therefore, the successful introduction of a new instruction set is a rare occurrence in the evolution of computer technology, while extensions to existing instruction sets occurs frequently. Some commonly used instruction sets are poorly designed from the beginning and amended with many extensions and patches. One of the worst cases is the widely used x86 instruction set family. This instruction set is the result of a long series of short-sighted extensions and patches. The result of this development is a very complicated code system which is very difficult and costly to decode in a microprocessor. We need to learn from past failures in order to be prepared to make better choices from the start, in case the opportunity to design a new instruction set should come up. The purpose of this article is to construct an example of a new instruction set that is better designed from the start, based on the experience we have with existing instruction sets. The following principles are important to have in mind:
  • The instruction set should have a simple and consistent design.
  • The instruction set should represent a suitable compromise between the RISC principle that enables fast decoding, and the CISC principle that makes more efficient use of code cache resources.
  • The design should be expandable so that new instructions and extensions can be added in a consistent and predictable way.
  • The instruction set should be designed through an open process with the participation of the international hardware and software community.
  • The instruction set should be non-proprietary and allow anybody to make compatible software, hardware and equipment for test, debugging and emulation.
  • Decisions about design and extensions should not be determined by the short term marketing considerations of an oligopolistic microprocessor industry but by the long term needs of the entire hardware and software community and NGOs.
  • The design should allow appliction-specific extensions.
The problems with the x86 instruction set are discussed in my blog article Stop the instruction set war http://www.agner.org/optimize/blog/read.php?i=25. See also Krste Asanović and David Patterson: "The Case for Open Instruction Sets. Open ISA Would Enable Free Competition in Processor Design". Microprocessor Report, August 18, 2014 (http://www.linleygroup.com/mpr/article.php?id=11267).

Basic architecture

Instruction format

A pure RISC instruction set has the advantage that all instructions have the same length. This makes it easy to decode multiple instructions in parallel. But it has the disadvantage of using a lot of precious space in the code cache. A CISC instruction format can have a variable instruction length. The well-known x86 format allows instructions of any length from 1 to 15 bytes. This makes the code more compact, but it is very complicated and expensive to decode. It is difficult for the microprocessor to decode multiple instructions in parallel because it needs to find the length of the first instruction before it knows where the second instruction begins, and the instruction length is determined by a complicated algorithm involving many elements of the instruction. Instruction decoding is therefore often a serious bottleneck.

The proposed instruction format is a compromise between these two principles. Instructions can have a few standardized lengths and formats, and the determination of the length is simple. This allows for smaller instructions to save size, and longer instructions when there is a need for more bits for address, data, registers or extra options. Many instructions exist in multiple versions with different sizes. The instruction format is completely orthogonal in the sense that the same instruction can be specified with register, memory or immediate operands, different integer sizes, different floating point precisions, different vector lengths, and different addressing modes.

An instruction can use one, two or three dwords of 32 bits each - that is 32, 64 or 96 bits. No other sizes are permitted. Instructions must be aligned to dword addresses. The first two bits (most significant bits) of the first dword of an in struction indicates the length:

00 = 1 dword
01 = 1 dword
10 = 2 dwords
11 = 3 dwords

In order to further save space in the code cache, there can be certain instructions which do multiple operations in a single short instruction, such as:
  • Set multiple registers to the same constant value (typically zero). The first register and the number of registers is specified.
  • Read multiple registers from consecutive memory addresses. Optionally increment pointer by the size. Can use the stack pointer or any other pointer register.
  • Save multiple registers to consecutive memory addresses. Optionally increment or decrement pointer by the size. Can use the stack pointer or any other pointer register.
  • Combined arithmetic operation and conditional jump.
A dword of all zeroes is a nop (no operation). The processor is allowed to skip nop's as fast as it can. These nop's can be used as fillers, but not as timing delays.

Registers

There are 32 universal registers, named r0 - r31. The proposed instruction set has only one type of registers. These registers can be used for all types of data: Boolean, 8-, 16-, 32-, 64- and (optionally) 128-bit signed and unsigned integers, floating point numbers with single, double and (optionally) quadruple precision, pointers, references, flags and predication masks. This reduces the number of different instructions because the same instruction can be used on different types of data, and because no instructions are needed for transferring data from one type of register to another. For example, the same 'AND' instruction can be used for operations on Booleans, for manipulating bits in integers, for manipulating the sign bit of floating point numbers, and for manipulating predication masks.

The same registers can also be used as vectors of any of these data types. The microprocessor must support vectors of at least 128 bits. Support for larger sizes is optional. Vector sizes up to 8192 bits can be specified by 3 bits in the instruction code. It is also possible to specify the largest available vector size in an instruction. This can be anything from 128 bits and up, with no upper limit. Software can take advantage of future extensions by specifying the largest available vector size. The largest size can be modified by a control register to any power of 2 from 128 to the largest size supported by the microprocessor.

The unused part of a register is always set to zero whenever a register is modified. No instruction leaves part of a register unchanged except for instructions intended for blending or interleaving data. This is important in order to avoid false dependencies on the previous value of the full register, which is known to cause serious performance problems in some existing processors (known as partial register stall). The processor does not actually need to spend power on setting all the superfluous bits to zero. Typically, it will simply turn off the unused parts of execution units and data buses in order to save power.

Stack

There is one stack. The stack register is r31. Including the stack register as one of the universal registers makes it possible to use it as a base pointer in memory addressing and to modify the stack frame with arithmetic instructions. The stack register needs only be 64 bits.

Instruction pointer

The instruction pointer is 64 bits. It is not included in the universal registers. The reason for this decision is to avoid the possible modification of the instruction pointer by arithmetic instructions, which would make branch prediction difficult.

Flags

There is no dedicated flags register. Registers r1 - r7 can be used as predicate registers or mask registers. Many instructions can be predicated. A predicated non-vector instruction will use one of these registers as predicate, and execute the instruction only when bit 0 in the predicate register is 1. The predicate register is thus also a Boolean variable. Execution is unconditional when r0 is specified as the predicate register.

The predication mechanism can be vectorized. A predicate vector is also known as a mask. A masked vector instruction works in the following way. Each element in the vector is processed only if the corresponding element in the mask register has 1 in its least significant bit. The mask register is treated as a vector of Booleans, where each element in the Boolean vector has the same number of bits as the data vectors in the instruction, and only the least significant bit in each Boolean vector element is used, while the remaining bits are ignored. (Other systems use the most significant bit, or all bits, in the mask, but it is preferred to use the least significant bit for the sake of compatibility between Boolean scalars and Boolean vectors). Results that are masked off are either unchanged or set to zero, depending on the instruction. Some instructions support both options to be selected with a feature bit.

Instructions for extended precision arithmetic, such as add-with-carry and subtract-with borrow work in the following way. One register is specified in the predicate register field of the instruction code. Bit 0 of this register is used as both carry-in and carry-out. The traditional arithmetic flags are written to a few bits of the predicate register:

bit 0: carry flag
bit 1: zero flag
bit 2: overflow of signed arithmetic
bit 3: sign bit
bit 4: negative = sign xor overflow

If the predicate register for an add-with-carry instruction is specified as r0 then the carry-in will be 0, but the arithmetic flags for the result will be written to r0. Shift and rotate instructions can output a carry to a predicate register, but may not have a carry-in. There are no rotate-through-carry instructions, but an add-with-carry of a register to itself can be used as a rotate-left-through-carry (Rotate through carry instructions are rarely used anyway, and they are very inefficient on many processors). Integer and floating point compare instructions also produce these flags.

The carry mechanism can be vectorized so that multiple add-with-carry operations can be executed in parallel.

Branches

Branching is done with combined arithmetic-and-branch instructions. These are ALU instructions such as add, subtract, compare, bit test, etc. combined with a conditional jump, for example: subtract and jump if not zero, compare and jump if above, test a specific bit and jump if it is zero. These instructions cannot be vectorized. The vector size field is used as condition code. There is no need to support predicated jump instructions because these can be replaced by a combined bit test and conditional jump instruction. Multiway branches can be implemented with indirect jump or indirect call.

Debug and interrupt flags

There are various control registers which can be used for debugging purposes, interrupt control, etc.

Addressing modes

The address space uses 64-bit addresses only. Addresses are always relative to the instruction pointer, stack pointer or a register pointer. Absolute addressing does not need not be supported. The following addressing modes are supported:
  • Instruction pointer + 32 bit sign-extended offset
  • Instruction pointer + index register + 32 bit sign-extended offset
  • Base register + 8 or 32 bit sign-extended offset
  • Base register + index register + 32 bit sign-extended offset
  • Base register + scaled index register + 32 bit sign-extended offset
The size of data operands or vector elements is always specified in the instuction. This size is used as a scale factor which is applied to all 8-bit offsets. For example, if the operand size is 32 bits = 4 bytes, then any 8-bit offset is multiplied by 4. 32-bit offsets are never scaled. The index register can also be scaled by the operand size.

Direct conditional and unconditional jumps and calls are always relative to the instruction pointer with 8-bit or 32-bit sign-extended offset, scaled by 4 because all instructions are aligned to addresses divisible by 4.

CPUID

A CPUID instruction must have functions for telling whether optional features are supported, e.g. 128-bit integers, quadruple precision floating pont, and the maximum vector size for each type of operands. There should also be features for telling how efficient certain instructions are, to help software determine the optional coding version.

Proposed code structure

An instruction code contains a combination of the fields described below, where some of the fields can be omitted. The total size of all the fields must be 32, 64 or 96 bits.

Instruction length: 2 bits.

00 = 1 dword = 32 bits
01 = 1 dword = 32 bits
10 = 2 dwords = 64 bits
11 = 3 dwords = 96 bits

Instruction format: 2 or more bits.

Each combination of the instruction length and instruction format bits defines a class of instructions having a particular combination of the remaining fields. In other words, the combination of instruction length and instruction format bits determines which of the following fields are present, and their sizes.

Instruction code: 6 or more bits.

These bits are used for distinguishing the individual instructions, such as add, move, jump, etc. The number of instruction code bits is simply the number of bits not used for anything else. Therefore, the number of instruction code bits can be different for different instruction formats. The instruction bits are not necessarily contiguous if the placement of other fields on fixed positions has higher priority in the design.

Register: 1 - 4 fields of 5 bits each.

Used for both operand registers, base pointer and index register.

Predicate register: 3 bits.

Specifies a register used for predicated scalar instructions, masked vector instructions, and flags input and output. Only r1 - r7 can be used as predicate register. r0 means no predicate.

Operand size and type: 3 bits.

Defines the type, size and precision of operands, integer or floating point. The size for integer operands can be 8, 16, 32, 64, and optionally 128 bits. The precision of floating point operands can be single, double, and optionally quadruple precision. Half precision is not supported, except in conversion instructions.

Vector length: 3 bits.

Specifies the length of vectors in bits. Possible values are: scalar, 128, 256, 512, 1024, 2048, 4096, and max. Support for values above 128 are optional. The size of operands is as determined by the operand size/type when "scalar" is specified. A vector will contain as many elements of the specified operand size as can be contained in the vector size. For example, a 256 bit vector can contain 8 elements of 32 bits each. The "max" specification gives the largest vector size supported by the processor. This value depends on the processor and must be a power of 2. The minimum allowed value is 128, with no upper limit. The max value may be different for different operand sizes. A piece of software can take advantage of future extensions by specifying the max vector size. The max value can be reduced by settings in a control register.

Addressing modes: 2 bits.

The following addressing modes are defined for memory operands. An instruction can have no more than one explicit memory operand with this specifiation.

00: IP + index + 32 bits offset (specify r31 for no index)
01: base + 8 or 32 bits offset
10: base + index + 8 or 32 bits offset
11: base + index * operand size + 8 or 32 bits offset

The base and index registers are specified in register fields. The offset size (8 or 32 bits) depends on the instruction format. 8-bit offsets are always multiplied by the specified operand size in bytes. For example, an operand size of 32 bits = 4 bytes will multiply the value in the offset field by 4. 32-bit offsets are not multiplied by this factor. Offsets are always sign-extended. It is not required to support 16-bit or 64-bit offsets or absolute addressing.

Jumps and calls have an offset of 8 or 32 bits relative to the instruction pointer. This offset is multiplied by 4 because all instructions have sizes that are multiples of 4 bytes.

Address offset: 8 or 32 bits.

This field is used as specified above under addressing mode.

Immediate data operand: 8, 32 or 64 bits.

An 8-bit immediate value is interpreted as an integer, sign extended to the specified operand size. The signed value is converted to floating point if a floating point operation is specified.

A 32-bit or 64-bit immediate is interpreted as an integer for integer operations or a single or double precision float for floating point operations. The integer immediate constant is sign-extended if necessary. The floating point immediate constant is converted to the desired precision if necessary.

16-bit immediates are not necessarily supported.

Rounding mode: 2 bits.

Optionally specifies the rounding mode used in floating point operations and conversions. Possible values are: round to nearest or even, round down, round up, truncate towards zero. The default value if there is no rounding mode field is "round to nearest or even". This option field is useful in float-to-integer conversion instructions, but rarely needed in other contexts. May be included in long versions of floating point instructions.

Exception control: 1 bit.

Enables or suppresses interrupts in case of numerical errors. This can be used for controlling exceptions in case of overflow and other errors in floating point operations. Can also be used for checking for overflow in integer arithmetic. An unsigned integer compare instruction with exception enabled can be used for checking if an array index is out of bounds. This feature may be included in long versions of arithmetic instructions.

Broadcast: 1 bit.

If 1, specifies that the last source operand is a scalar to be broadcast into all the vector elements. (Unnecessary when this is an immediate operand).

Zero masked data: 1 bit.

Specifies whether masked-out elements are set to zero or left unchanged. This bit may replace the broadcast bit on instructions with only register operands, or it may be a separate bit.

Other fields.

Other fields may be added if specific features are needed. Alternatively, an immediate data field may be used for specifying additional feature options.

Formats.

Commonly used instructions may be implemented in several different formats and instruction lengths, preferably with the same value in the instruction code bits. For example, an addition instruction, A = B + C, might be implemented in the following formats:
  • 3 registers (1 dword).
  • 2 registers and a predicate (1 dword).
  • 1 register and a predicate and an 8-bit immediate (1 dword).
  • 1 register and a memory operand with base and 8 bit offset, scalar only (1 dword).
  • 2 registers and a predicate and a 32-bit immediate (2 dwords).
  • 1 register and a memory operand with base and 32 bit offset (2 dwords).
  • 2 registers and a predicate and a memory operand with base and index and 32 bit offset (3 dwords).
  • 2 registers and a predicate and a 64-bit immediate (3 dwords).
The destination and the first source operand share the same register in some of these cases. The operand size and vector length bits can be used for specifying integer or floating point operands of different sizes and precisions in scalars or vectors of different sizes. In other words, the different variants of this instruction can be used for adding a register variable, a memory variable, or a constant to integers of any size as well as floating point variables of any precision in scalars and vectors of any size.

Combined ALU and conditional jump instructions may be implemented in the following formats:
  • 2 registers of 32-bit integers only, a condition code and an 8 bit displacement (1 dword).
  • 2 registers, a condition code and a 32 bit displacement (2 dwords).
  • 1 register, an 8 bit immediate, a condition code and a 32 bit displacement (2 dwords).
These instructions have no vector length field but a condition code instead. Floating point operands are not necessarily supported.

3-input instructions, such as fused multiply-and-add may be implemented in the following formats:
  • 3 registers and a predicate and option bits (2 dwords).
  • 2 registers and a predicate and a memory operand with base and index and 32-bit offset and option bits (3 dwords).
The destination register is the same as one of the source operand registers. The options include 4 bits for specifying sign change for even and odd vector elements of the addend and for even and odd vector elements of the product, respectively. This will cover all possible combinations of multiplication with addition, subtraction and alternating add/subtract in a single instruction.

Less commonly used instructions may be implemented in just one or a few different formats.

FPGA

The microprocessor can have an optional FPGA or similar programmable hardware. This structure is used for making application-specific instructions or functions, e.g. for coding, encryption, data compression, signal processing, text processing, etc. This reduces the need for hard-coded application-specific instructions.

If the processor has multiple CPU cores then each core may have its own FPGA. The hardware definition code is stored in its own cache for each core. The operating system should prevent, as far as possible, that the same core is used for different tasks that require different hardware codes. If this cannot be avoided then the code, as well as the contents of any memory cells in the FPGA, must be saved on each task switch. This saving may be implemented as lazy, i.e. the swap of contents is only made when the second task needs the FPGA structure that contains code for the first task.

Extensibility

The evolution of the x86 instruction set is full of short-sighted decisions with no room for future extensions. All kind of weird patches have been used to extend an instruction set that was never designed to be extensible. We must learn from these mistakes and consider the predictable need for future extensions when designing an instruction set.

There is reason to suspect that many of the instructions in the current x86 instruction set have been added with short-sighted marketing reasons in mind. Every new generation of microprocessors must have some new features that the competitors don't have, or new features that can be hyped to make customers buy the new version, according to the marketing logic. Some of these instructions are now obsolete, but still supported by the hardware.

The design of a stable instruction set should not be subject to competition and marketing whims, but designed by an open process with participation of the international hardware and software community, similar to the standardization work in other technological areas. A collective decision process reduces the risk of mistakes and short-sighted decisions.

The proposed instruction set is orthogonal, which reduces the number of different instructions. The inclusion of an FPGA reduces the need for application-specific instructions.

In addition to these considerations, it is necessary to add space for future extensions of the instruction set. Some of the instruction formats should have a large number of unused instruction code bits that can be used for future instructions or option bits. A few instruction format codes should be reserved for future extensions. All codes that begin with 111, i.e. 11 in the instruction length bits and 1 in the first bit of the instruction format field, should be reserved for future extensions. These bits could be used in the future either for more 3-dword formats with many instruction bits, or for 4-dword formats. This decision will be left to the future.

An attempt to execute an instruction with an unknown instruction code should cause an interrupt (exception) in most cases. This makes it possible to emulate new instructions on old microprocessors. In some cases, however, it is desired to make extensions that do not cause interrupts on microprocessors that don't support them. Historically, this has been done with extensions that affect performance, but not functionality, such as memory prefetching and branch prediction hints. This kind of extensions can be made by using various option bits in contexts where they previously made no sense, for example rounding mode bits in an integer instruction. Thus, the processor should ignore certain unused option bits in certain instructions to make this kind of performance extensions possible. Also, a small range of instruction codes should be reserved for future performance-tuning instructions, which will be ignored on processors that don't support them. To be more specific, we will have three categories of unused codes:
  • Code reserved for future use. Generates interrupt so that it can be emulated.
  • Code reserved for future use. Generates no interrupt, but behaves as a nop (no operation). Can be used for future purposes that allow the code to be ignored on processors that do not support it.
  • Code guaranteed to never be used. Will generate interrupt also on all future processors. Can be used for application-specific emulation or forced error messages.
Extending the vector register size

Extension of the vector register size is straightforward without the need to define any new instructions. This makes it possible for software to utilize a new extended vector size even without recompilation. The software can simply specify the maximum vector size and get information from a CPUID instruction about what this maximum vector size is.

We have seen in the history of x86 processors that the first processor generation to support a new and larger vector size has typically done so with poor efficiency. In most cases, it has used a half-size execution unit twice to do a full-size vector operation. This was not necessarily a bad design choice because the software that supports a new instruction set extension typically lags several years behind the hardware.

The situation is different with the extension mechanism proposed here. The software will be able to utilize a vector size extension immediately. The microprocessor should not allow a larger vector size than it can execute more efficiently than if software used the next smaller vector size twice. It may still be worthwhile to allow a vector size that is larger than the execution unit and use this unit multiple times to process a full-size vector. This will save bandwidth in the decoder and use fewer registers than the alternative of repeating the instruction in software. The CPUID instruction should provide complete information about this. This means that it should specify both the maximum vector size that can execute at full throughput and the maximum vector size that can execute at all. These values may be different for different types of operands.

Portability

The ABI, object file format, etc. should be standardized as far as possible in order to allow the same code to be compatible with different operating systems and platforms. This would make it possible, for example, to use the same function libraries in different operating systems. This can easily be achieved for libraries that are doing some mathematical operation and not using any system functions. A more ambitious goal is to establish portability even when some common system functions are involved, such as time functions or handling of multithreaded code. Most importantly, there should be a portable way of generating error messages from a function library. This could be obtained with an error message instruction. This instruction should generate an interrupt (throw an exception). A few register operands can contain codes indicating the type of error, and a memory operand can point to an error message text. All platforms should be able to handle this error condition in a way that is appropriate for the type of user interface. In console mode applications, for example, the error message should go to the stderr output. In graphical user interface (GUI) applications, the error message should be shown in a pop-up window, or whatever method is appropriate for the specific GUI framework.

Error messages should be in the English language by default, with an optional feature for multi-language support. We can expect the need for multi-language support to be decreasing. The problems with multi-language applications are discussed in this document https://en.wikibooks.org/wiki/Usabilit ... nalization.

ABI and calling conventions

This is an example of how an efficient ABI (Application Binary Interface) can be designed.

Function calls will use registers for parameters as far as possible. The first 24 parameters to a function are transferred in register r0 - r23. Any additional parameters are transferred on the stack in C language order. These parameters are removed from the stack by the caller. The function return value is in r0. Push and pop instructions are rarely used. The return instruction has no offset. The stack is kept aligned by 128 before any call instruction.

Variable argument lists are transferred on the stack with 64 bits for each argument.

Tuples: A structure or class or encapsulated array for which all non-static elements have the same type is transferred in a single vector register if the total size does not exceed 128 bits.

Parameters that do not fit into a single register are transferred by a reference to a memory object allocated by the caller. This applies to: structures and classes with elements of different types, or bigger than 128 bits, or having a non-standard copy constructor or destructor or virtual member function. It is the responsibility of the caller to call any copy constructor or destructor. Any parameters beyond the first 24 parameters are transferred in the same way as if they were in a register, using 64 or 128 bits of stack space, as appropriate.

A return value that does not fit into a register is transferred to a memory location allocated by the caller. A reference to this memory location is treated as the first parameter (before any 'this' pointer).

There are no registers with callee-save status in the case of static linking. This is because the called function does not know the maximum vector register size required by the caller. Instead, there is a mechanism that allows the caller to know which registers are modified by the called function. This information is stored in the object file for static link libraries. The object file format must support this information, which must be stored in the same file as the library function in order to make sure it has the right version. Compilers supporting "whole program optimization" can read this information from a library file before allocating registers.

This mechanism cannot be applied to dynamic linking. Instead, dynamic link libraries are prohibited from modifying certain registers.

The object file format should be a modified ELF format. Dynamic linking should use the Windows DLL method rather than the UNIX shared object method. The code uses position-independent addressing as far as possible. Any remaining relocation is performed at load time. Symbol imputation is not used. This eliminates the need for the inefficient global offset table (GOT) and procedure linkage table (PLT).

Information used for exception handling and stack unrolling should use a standardized and portable table-based method. Debugging information should also be standardized.

Assembly language syntax

The syntax for x86 assembly code has never been officially standardized, but each assembler uses its own dialect. The definition of a new instruction set should include the definition of a standardized assembly language syntax, preferably with the destination operand first.

Summary of advantages

The instruction set proposed here is a compromise between the RISC and CISC principles. A RISC instruction set with a fixed instruction size makes it easy to decode multiple instructions in parallel, but it is a vaste of precious code cache space. If the fixed instruction size is big enough to accommodate all possible instruction types, then it must necessarily be too big for the most common simple instructions and therefore take up too much space in the code cache. The code cache is a precious resource because it is impossible to make the cache bigger without also making it slower. A CISC instruction set with many different instruction lengths makes it difficult to decode multiple instructions in parallel, and this can be a serious bottleneck. The proposed instruction set has just a few standardized instruction lengths: one, two and three dwords of 32 bits each. The length of the instruction is determined by the first few bits. This makes it possible to determine the lengths of multiple instructions in a single clock cycle by a process that resembles the look-ahead carry mechanism in binary adders.

The instruction set is completely orthogonal. An ordinary arithmetic or logic instruction, such as e.g. addition, can have many different versions for different types of operands. It can handle integers of 8, 16, 32, 64, and possibly 128 bits, as well as floating point numbers of single, double, and possibly quadruple precision. The last source operand can be a register, a memory operand, or an immediate constant. The same instruction can handle scalars or vectors of any length. This makes programming simpler and reduces the number of different instructions.

There is only one type of register. The same registers can be used for many different purposes, including integers and floating point numbers of all different sizes and precisions, as well as for Booleans, flags, pointers and references. The registers can also be used for vectors and masks.

Many instructions can be predicated, so that the instruction is either executed or not, depending on a Boolean variable stored in a predicate register. The predicate mechanism can be vectorized, so that the operation is turned on or off for each element in a vector, depending on a mask register containing a vector of Booleans.

Instructions can have short versions that save space by using only two operands, by omitting certain option bits, by using an 8-bit scaled offset in a memory operand, or by using a signed 8-bit constant as the immediate operand. For example, a double precision floating point addition can have immediate operands of three different sizes: a signed 8-bit integer which will be converted to floating point, a single precision float, or a double precision float. This constant is available at an early stage in the CPU pipeline so that there is enough time for converting it to the necessary size and precision without delaying the execution. The need to fetch numeric constants from data memory is eliminated in most cases because numeric constants can be contained in the instructions. This will increase the speed and reduce the load on the data cache. In most cases, the code size will not be increased by the inclusion of numeric constants in the instructions because they replace the addresses (typically 32 bits) of constants stored in data memory, and because it will fit the constants into smaller formats whenever possible. Immediate constants can even be used with vector instructions where the same constant will be applied to all elements in the vector.

The size of the registers is not fixed in the design. It is possible to make bigger and more powerful microprocessors simply by making the registers bigger so that they can hold larger vectors. This mechanism is orthogonal as well. There are three bits in the instruction code which determines the vector length (or a scalar). This makes it possible to write software for future microprocessors with bigger vector registers that do not exist yet. Setting the vector length bits to 111 will give the largest vector size that the microprocessor supports, whatever this is. This makes it simple to support all vector sizes in the same piece of software. This feature also makes it possible to save an entire register even though the maximum register size is not known when the software is compiled. This can be useful in task switches, exception handlers, device drivers and system libraries. There is no limit to how big the maximum vector size can be. A CPUID instruction will tell the software what the maximum vector size is, and there will be a feature that enables a software program to reduce the maximum vector size if it is excessive.

The conventions for function calling, as well as other ABI details, should be specified together with the instruction set. This will improve compatibility and make it possible to use the same function libraries with different compilers, different programming languages, and different operating systems. There are 32 registers. This makes it possible to use registers for function parameters in almost all cases.

Can existing instruction sets be fixed?

The commonly used instruction sets can be divided into two general types, RISC and CISC. The RISC instruction sets generally have a more or less fixed instruction size. All instructions have the same number of bits. The advantage of a RISC design is that the fetching and decoding of instructions is simple and fast. The disadvantege is that commonly used simple instructions take more space than necessary while complicated instructions do not fit into the limited instruction size. Instructions that need many bits for addresses or constants do not fit into the RISC design.

A CISC instruction set has a variable instruction length. The advantage of this is that simple, commonly used instructions can be as small as a single byte, while more complex instructions or instructions with large addresses or constants can have a length that fits the purpose. This provides optimal use of the code cache. The disadvantage is that it is complicated to decode the instructions. Modern microprocessors can execute three or four instructions in parallel in a single clock cycle if no data dependence prevents this. But it is difficult to decode multiple instructions simultaneously when you have to determine the length of the first instruction before you know where the second instruction begins. Therefore, the bottleneck in a CISC processor is quite often decoding rather than execution.

The present article has proposed a compromise between RISC and CISC. The widely used x86 instruction set is a CISC design. Mosts other instruction sets in common use today are RISC designs.

x86 instruction set

The x86 instruction set has a long heritage dating back to the 8086 processor in 1978 where code density was of prime importance. It has been developed through many generations of additions and extensions without ever loosing backwards compatibility. It is a CISC instruction set where instructions can have any length from 1 to 15 bytes, and it is quite complicated to determine the length of each instruction. It has many different types of registers. The latest extension, AVX-512 has 16 general purpose registers of 64 bits each, 6 segment registers of which only 3 are used in 64-bit mode, 8 floating point registers of 80 bits each, 8 MMX registers of 64 bits each which are overlaid on the floating point registers, 32 vector registers of 512 bits each, 8 mask registers of 64 bits each, a flags register and an instruction pointer. This patchwork could certainly need a redesign. Can it be combined with the design principles that are proposed here?

An easy solution would allow the two kinds of code to be used interchangeably and mixed. The new instructions would have to use some bit patterns that are not already in use in the old system. The x86 instruction set has 20 byte-codes that are currently used only in 16-bit and 32-bit mode, mostly for obsolete instructions. These codes can be used for other purposes in 64-bit mode. Therefore, it is possible to make new extensions that can be used only in 64-bit mode. We would prefer 64-bit mode anyway, so it would be possible to make extensions that implement some of the principles described here and still preserve backwards compatibility, but this would still be only patches on a fundamentally flawed, inefficient and outdated design. The 20 unused code bytes are scattered around the code map with only few adjacent code bytes, so it would be impossible to use more than a few of these code bytes without making the whole system completely messy. Most of the bits in the first byte of any new code would therefore be fixed and unusable in such a hypothetical new code design.

A better solution would be to implement a separate mode for the new instruction design and a system for switching between the new mode and the legacy modes. The improvement in performance that can be obtained with a new instruction design is probably not high enough to justify the complications of a dual code system. Instead, we should be prepared to seize the opportunity in case the need for a major revision should arise for other reasons. It is not possible to make a decoder that translates the old codes to the new ones at runtime, because the new system does not support the many different types of registers that the old system has. A translation from the new system to the old one is also not possible. Instead, we would need two seperate decoders that translate the old and the new codes, respectively, to the internal micro-operation format. This micro-operation format probably needs to be expanded to make space for 64-bit immediate constants, but the extra bits can be disabled when they are not needed, in order to save power.

The existing execution units could relatively easily be modified to support the new code design. The 32 universal registers of the new design should obviously be identical to the 32 vector registers of the old design. Combined ALU-and-conditional-jump instructions are already implemented internally in both Intel and AMD processors even though they are not available as x86 instructions.

It is a problem that many current processors have their execution units divided into two main clusters: One cluster is connected to the general purpose registers and handles integer scalar operations, pointer addressing and jumps. The other cluster is connected to the floating point and vector registers and handles all floating point and vector operations. All transfers of data between these two clusters typically involve a delay of one clock cycle. This two-cluster design would be a problem for the new instruction set where all units need access to the same register file.

Itanium instruction set

The Itanium instruction set is a very ambitions RISC instruction set. Itanium instructions are joined into bundles with a fixed size of 128 bits, containing three instruction codes of 41 bits each and a 5-bit template. The three instructions in a bundle will execute in parallel. This explicit parallelism puts a lot of work on the compiler to schedule instructions that can execute in parallel without violating the program logic. The Itanium has a rotating register stack where each program function allocates the number of registers it needs. It has many other advanced features, such as explicitly speculative instructions. The itanium design has not been very successful, due mainly to the difficulties of making a suitable compiler. Another obstackle to the commercial success of the Itanium was a poor support for backwards compatibility with existing software. The Itanium system is so different from other systems that there would be little advantage in combining it with a new instruction set.

Other RISC instruction sets

Most other commonly used instruction sets today are RISC designs. These designs are generally simple and efficient. The instruction length is typically 32 bits. Some systems, such as ARM-Thumb-2 and AVR32 can use a mixture of short 16-bit instructions and longer 32-bit instructions. Most systems have several different register types. Some RISC processors support vector instructions with 128-bit vectors. There is a limit to the number of different instructions that can be coded in an instruction with a fixed size. It is a general problem with RISC instruction sets that they cannot support complex instructions with many option bits. This makes it difficult to add new options and features that the x86 instruction set has, such as masked vector operations, options for controlling rounding mode, etc. The limited instruction size of the RISC systems is also a problem where large addresses or large numeric constants are needed. It is not possible to define a large numerical constant or a jump to a distant address with a single instruction in a RISC design with a limited instruction size. Some of the RISC processors already have support for more than one instruction set and features for switching between these modes. An additional mode for a new instruction set could be added to these processors without serious problems.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Itanium

Post by agner » Thu Nov 02, 2017 8:12 am

Author: Ethan, Date: 2015-12-28 01:04
Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?

Author: Agner, Date: 2015-12-28
Ethan wrote:
Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?
The advantage of the Itanium instruction set was of course that decoding was easy. The biggest problem with the Itanium instruction set was indeed that it was almost impossible to write a good compiler for it. It is quite inflexible because the compiler always has to schedule instructions 3 at a time, whether this fits the actual amount of parallelism in the code or not. Branching is messy when all instructions are organized into triplets. The instruction size is fixed at 41 bits and 5 bits are wasted on a template. If you need more bits and make an 82 bit instruction then it has to be paired with a 41 bit instruction.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for an ideal extensible instruction set

Post by agner » Thu Nov 02, 2017 8:23 am

Author: hagbardCeline Date: 2015-12-28 04:08
You should take a look at RISC-V [1], which satisfies all the requirements you set, has broad involvement of academia and strong interest by the industry.

[1] riscv.org

Author: Agner Date: 2015-12-28 07:28
hagbardCeline wrote:
You should take a look at RISC-V [1], which satisfies all the requirements you set, has broad involvement of academia and strong interest by the industry.
[1] riscv.org
Thank you for the reference to RISC-V. I remember reading about it years ago, but couldn't remember the name. I tried in vain to find it with google.

RISC-V does indeed cover many of the same principles that I talk about. However, it seems to be more inspired by small systems of the past than by the bigger and more powerful high-end processors available today. A new ISA has to be future-oriented and performance oriented. Some of the things that I miss in RISC-V are:
  • It is not completely orthogonal
  • Arithmetic instructions cannot have memory source operands
  • Immediate constants have odd sizes. It is not possible to include floating point immediates, which I argue would be more efficient than loading floating point constants from data memory
    There are no predicated or masked instructions
  • 128-bit integers are not supported, except as pointers in 128-bit address mode
  • Support for vectors is not well developed. Vector size is limited to 1024 bits
  • There is no way to save and restore a vector register that is guaranteed to be compatible with future extensions
  • Software has to be recompiled each time a different processor with different maximum vector size becomes available
  • There is no support for integer vectors, Boolean vectors, masked vector operations, broadcast, etc.
  • long int can be 32 or 64 bits. There is no standardized way of specifying 64-bit and 128-bit integers. This inconsistency is causing annoying compatibility problems today which need to be fixed in any new ABI.
But again, I like the idea behind RISC-V

Author: Adrian Bocaniciu Date: 2016-01-04 05:57
While I fully agree with the stated purpose of RISC-V, I strongly disagree with some of their choices for instruction encoding, especially with their addressing modes.

The features proposed by Agner are much closer to what I would consider a good ISA, and I have a lot of experience in programming in assembly language for a huge number of different ISA's from antiquities like IBM System/360, PDP 11, Intel 8080 or Motorola 6800 up to current ISA's, like Intel/AMD, IBM POWER or ARM.

I only want to comment about the addressing modes, because many other ISA proposals, including RISC-V, do not seem to have any clue about how they are used in real programs.

There are only 2 possible choices for a set of addressing modes that would allow writing a loop with a minimum number of instructions.

The first possible choice coincides with the subset of the VAX 11 addressing modes implemented in Intel 80386, which, like in the Agner proposal, allows addresses with 3 components, a base register, a scaled index register and an offset included in the instruction.

This choice of addressing modes was probably the only feature of the Intel 80386 ISA that was better than in the earlier Motorola MC68020. Motorola has chosen to implement almost all the addressing modes of VAX 11, not only the subset chosen by Intel, but the addressing modes omitted by Intel were not really useful, so eventually even Motorola abandoned them in the ColdFire processors.

The second possible choice for the set of addressing modes appeared initially (around 1980) in one of the IBM RISC processors that were later developed into IBM POWER. This choice was also adopted by ARM, after it was published by IBM at one of the early RISC conferences.

This second choice is to allow addressing modes with only 2 components, a base register and an offset either in a register or in the instruction, but to allow updating the base register with the computed address.

I believe that the IBM choice is somewhat better, but both choices are acceptable. Any other set of addressing modes, e.g. RISC-V, is wrong, because it requires in almost all loops the insertion of extra instructions for updating the addressing registers.

Even if the hardware could execute the extra instructions in parallel, there would be a waste of resources anyway, because the extra instructions would occupy decoder slots and space in the instruction caches.

The IBM choice has the advantage that it does not require a second address adder and a shifter for scaling, but the disadvantage that it requires an extra write port into the register file.

From a software point of view, the IBM choice has the advantage that it is universal, i.e. it can be applied to any loop, while the Intel 80386 choice can be applied only to loops where the data structures have been chosen carefully. The reason is that, in order to avoid extra address updating instructions, the scaled index register must be the loop counter, and this, together with the limited set of values that may scale the index, forces that the ratios between the sizes of the elements of the arrays accessed during the loop must belong to the set of scale values (1, 2, 4, or 8 for Intel/AMD).

Nevertheless, these constraints for data layout are acceptable in most cases.

In order to evaluate which choice is cheaper from a hardware point of view, it is necessary to know exactly the technology used for implementation. If a second write port would be needed anyway for the register file due to other reasons, then the IBM choice would be certainly cheaper.

So, in conclusion, the set of addressing modes proposed by Agner is certainly much better than that of RISC-V.

I also completely agree with the use of a set of general registers instead of a dedicated flag register.

There should also be a complete set of instructions that would allow the writing of efficient programs for multiple precision computation, e.g. the GMP library.

Despite the ugliness of most of the legacy part of the Intel ISA, during the last 10 years Intel has improved continuously the support for multiple precision computation, leaving all the competition far behind.

All the RISC ISA's had traditionally bad support for multiple precision computation, even if that had nothing to do with the RISC principles. Even in the old days, when the need for encryption was not yet widely recognized, there were some users, like myself, who executed frequently that kind of instructions for scientific and technical applications.


Author: Adrian Bocaniciu Date: 2016-01-04 06:41
While I agree that having universal registers would be much better than having separate integer & FP registers, I doubt that it is good to have overlapped scalar & vector registers.

I think that it would be better to have 32 scalar registers and 32 vector registers. I do not think that this implies any significant changes in the instruction encoding that you have in mind.

This would certainly simplify the task of the operating system and of the interrupt routines to decide which registers should be saved.

This would certainly also make easier any extension to much longer vectors. I have seen several opinions on the Internet, and I agree with them, that many features of the ISA's used by Cray and by a few other vector processors were actually much more convenient to exploit in software than the current MMX/SSE/AVX style of vector instructions.

In conclusion, I believe that separate scalar & vector registers would be simpler to use, because the scalar registers, having a known length, can be saved in a predictable way, without examining any state registers. You could make a sophisticated save unit that to would save only the non-null parts of the vector registers, but it would insert an unpredictable delay that would not be acceptable for interrupt routines.

Moreover, any program already has distinct scalar & array variables, so mapping them to scalar & vector registers is trivial.

Like I said, I also think that a reading of the old Cray manuals that are freely available, e.g. at bitsavers.informatik.uni-stuttgart.de/pdf/cray/
and the comparison of their vector instructions with AVX-512 to see what is best between themselves and how they can be extended to greater vector lengths, could be useful to improve your ISA proposal.

Author: Adrian Bocaniciu Date: 2016-01-04 09:20
I agree with most of the features of the proposed ABI.

Nevertheless, I believe that a modern ABI should take into account that C is no longer the only language for which the ABI should be adequate.

Some requirements of other languages can be easily accommodated, e.g. for languages that allow returning multiple values, they can be placed in multiple registers starting with r0, exactly like the input arguments, not only in r0, like the single return value of C.

A much more important requirement of other languages is to allow the efficient implementation of procedures with tail calls, e.g. with tail recursion.

For this, the stack must be deallocated in the called procedure, not in the caller.

The one and only reason for the existence of the so-called C calling convention, where the caller deallocates the stack, is that it was a lazy solution to the (former) existence of lazy C programmers, who called vararg functions, e.g. printf, without also including the appropriate header where its prototype was declared (or including pre-standard headers, where vararg functions were not marked), thus the compiler could never know if an external function was vararg or not, so it had to suppose that all of them are vararg.

If such practices are prohibited, as they should be, then the right implementation of vararg functions is that the compiler must add an extra hidden parameter, e.g. the old value of the stack pointer, that would allow the called procedure to correctly deallocate the stack.

In that case the ABI should specify that the callee must deallocate the stack. There is absolute no advantage to defer the deallocation until after the return.

Besides allowing efficient tail calls, this ABI rule would also reduce the size of the code, because it replaces multiple deallocation instructions from the callers with a single instruction in the callee.

I am aware that the defendants of the C calling convention typically claimed that its code size disadvantage is not so great, because the compiler may coalesce several deallocation instructions into only one, inside the caller (if it includes multiple procedure calls).

I do not agree with this claim. The main environment where C is still dominant, and where code size is also essential, is in programs for embedded computers. However, that is also the environment where the stack size is severely constrained and deferring stack deallocation, to reduce the code size, greatly increases the risk of stack overflow, so that is not an acceptable solution.

Author: Adrian Bocaniciu Date: 2016-01-04 10:11
I want to add something to my former post where I mentioned that the classic vector computers, e.g. Cray, could provide useful inspiration for a new ISA, besides the modern SIMD instructions, e.g. AVX-512.

I do not remember now who made this observation first, because I have read it somewhere, but I agree with it.

The main advantage of the classic vector computers is that they had a vector length register, which determined the size of the performed operation, and that length could have any value between 0 and the maximum length of the vector registers, i.e. it was not restricted to powers of two, like in the instruction field from your proposal or like in AVX, Neon etc.

This has the advantage that it simplifies considerably the code that must deal with the initial or final parts of the arrays whose sizes are not a multiple of the size of a vector register or which do not have a correct alignment.

For AVX-512 and longer vector registers it is likely that the code dealing with all the possible cases of lengths and alignments will be much larger than the code for the main loop, and it will increase in geometric progression for each new maximum vector length.

With a vector length register, this extra code will be much smaller and its size will remain unchanged when the size of the vector registers will be increased.

Unlike the separate scalar & vector registers, which I believe to be mandatory from the perspective of the operating system & ISRs, and which is a feature with minimal influence upon your proposed instruction encoding, the use of one or more vector length registers (maybe the registers r8 to r16 , like the use of r0 to r7 for predicate registers) might require more significant changes in your proposal for instruction encoding, e.g. the use of those 3 bits to specify a vector length register instead of a length, so you must assess if you believe that it is worth it.

Like I said this is not my proposal and I do not remember right now where I have read it, but I found its argumentation compelling.

Author: Adrian Bocaniciu Date: 2016-01-05 06:47
Just an additional explanation for my previous post.

I was not clear enough, but one possible modification of your encoding scheme for vectorial operations would be to keep 0 = scalar registers & 7 = vector registers of maximum length, but to have 1 to 6 specify a vector length register, e.g. r9 ... r14.

This would partially loose the advantage of your encoding scheme of allowing old programs to run unchanged on new processors with longer vector registers.

Nevertheless, that could still be done, by querying the maximum length of the vector registers with CPUID and acting accordingly. The ability of performing a vector operation with a specified vector length would make easy the writing of generic programs that would process correctly the final or initial part of the data arrays, regardless of the length of the implemented vector registers.


Author: John D. McCalpin Date: 2016-01-05 10:47
Re: vector length issues:

Various folks at Berkeley have been looking at flexible vector architectures for microprocessors. A search for Krste Asanovic and vectors should point you at much of the recent work.
Their proposals have long included reconfigurable vector registers -- allowing a block of SRAM to be divided into different configurations of vector length vs number of vectors.
More recent proposals would allow the elements of different vectors to be of different sizes.

Asanovic has argued for an ISA that would allow decoupling the vector width from the parallelism of the underlying hardware, so a single binary could have its vector instructions pipelined through whatever number of "vector pipelines" an implementation happened to provide. The presentation slides are at http://riscv.org/workshop-jun2015/riscv ... ne2015.pdf

I like the ideas, but have not looked at them in enough detail to form an opinion about their practicality.

I can say that I am getting very tired of trying to work around the limitations of existing SIMD vector ISAs. They are great when everything is lined up, but in that case you are almost always bandwidth-limited so the extra functional units don't help. They are a real pain when the data in the registers needs to be rearranged, which is the most common way that physics-based codes generate computational intensity.


Author: Adrian Bocaniciu Date: 2016-01-06 07:12
Yes, you are right, this presentation from Berkeley by Krste Asanovic about the need for the resurrection of the vector computers was what I had in mind.

While most of the features of the ISA proposed by Agner would be significant improvements over RISC-V, the inclusion of conventional packed SIMD instructions would be much less desirable than the kind of vector ISA extension proposed for RISC-V.

I am currently using the hardware that during 2015 had the best performance-per-watt and the best performance-per-dollar for double-precision computations (Xeon D-1540 + FirePro W8100), so I have no complaints about the peak performance achievable.

Nevertheless, approaching that peak performance requires a lot of annoying optimizations and special case processing imposed by the programming for packed SIMD + GPU, so a decent vector ISA would be a clear improvement. Thus I completely agree with Krste Asanovic.


Author: Ook Date: 2016-01-05 17:00
Adrian Bocaniciu wrote:
This has the advantage that it simplifies considerably the code that must deal with the initial or final parts of the arrays whose sizes are not a multiple of the size of a vector register or which do not have a correct alignment.

For AVX-512 and longer vector registers it is likely that the code dealing with all the possible cases of lengths and alignments will be much larger than the code for the main loop, and it will increase in geometric progression for each new maximum vector length.

With a vector length register, this extra code will be much smaller and its size will remain unchanged when the size of the vector registers will be increased.


AVX512 also adds masking which is supposed to keep the prolog/epilog overhead small.
I have no hard data if this will work out.
Another approach to get rid of this overhead is the software pipelining variant as described by Ivan Godard in one of his Mill talks.


Author: acppcoder Date: 2016-03-27 00:39
The rationale behind the RISC-V instruction set:
"There are no predicated or masked instructions"
This is to simplify Out-Of-Order & superscalar designs:
This is why there aren't even condition codes; branches make their own comparison. (BLT src1,src2, dst etc)

The dependancies in the instruction are all expressed directly in fixed locations (registers) allowing these to be reasoned about very early in the pipeline, and without having to track additional state in the reorder buffer. Predication/CMOV etc introduce additional dependancies.

(I was disappointed there's no 'Select' but given everything else, it's ok, the proposed vector handle the case where you'd want it)
"Arithmetic instructions cannot have memory source operands?"
"Immediate constants have odd sizes. It is not possible to include floating point immediates, which I argue would be more efficient than loading floating point constants from data memory"
"128-bit integers are not supported, except as pointers in 128-bit address mode"
This is all a consequence of the simple RISC predominantly load-store, fixed-length 32bit instruction idea, which simplifies pipelining: the instruction decoder is trivial. (there is a compressed 16bit instruction capability, but it's optional)

classic RISC design makes it possible to produce a very simple implementation. RISC-V is designed to scale from (i)very small embedded cores to (ii)large high performance cores, or (iii)high throughput (manycore) accelerators (the latter 2 being different cases). The RISC instruction set principles have proven to work in all contexts.

The choice for a pure RISC is justified as follows: In the past few decades CISC has only ever been pursued for backwards compatability with x86. Most attempts to produce something else from a clean slate adopted RISC ideals, and the only other processors that became popular were RISCs.
going all the way back to the simple case, I recall ARM first implementation was a 28000 transistor pipelined 32bit implementation whilst the contemporary 68000 was literally 68000 transistors for a non-pipelined, 16bit version; it's proven RISC can scale all the way down (even today with huge transistor counts, there is the possibility of building a manycore dataflow processor; adapteva are probably pursuing this, given their previous product and their stated intention to use RISC-V next.. the idea is to cram as many cores as possible with local memories onto a die). Another important use case is as the basis for accelerators: you just need *something* simple&complete to handle basic computation as a basis for some custom unit revolving around some extention.
"Support for vectors is not well developed. Vector size is limited to 1024 bits"
"Software has to be recompiled each time a different processor with different maximum vector size becomes available"
"There is no support for integer vectors, Boolean vectors, masked vector operations, broadcast, etc."
take a look at the hwacha vector unit: it definitely decouples vector size from ISA, and does have prediction and so on. but it's deliberately not part of the basic standard, I think they are still finalising it? a high throughput accelerator can always be built by going manycore with attached vector units.

I don't think broadcast suits them because it's designed for the vector 'lanes' to be completely independent for hiding latencies: it's more comparable to GPGPU (and the old cray vector machines) rather than x86 style SIMD.

Still , after all that, mayebe someone else will eventually make a contrasting 'classic' 4-element SIMD unit extention that might suit 3d maths better (with permutes for accessing x/y/z/w, a dot-product across the lanes etc.). But from what I can gather the hwacha unit is easier to compile general purpose code for.
"long int can be 32 or 64 bits. There is no standardized way of specifying 64-bit and 128-bit integers. This inconsistency is causing annoying compatibility problems today which need to be fixed in any new ABI."
I think they have an unusual intention to scale to 128bit for future data centres, although I don't know what they'd do about that.

All in all I'm a fan of the RISC-V decisions. The whole thing reminds me of MIPS which was a great ISA.

I do like the idea of a unified register file, however at some point for creating a standard you have to draw a line. the advantages of separate files are in instruction length, they only need 5bits rather than 6, and of course implementations could physically move those register files closer to their units respective execution units


Author: Jake Stine Date: 2016-01-11 17:48
On Scalar vs. SIMD register banks:

I think that it is actually better to have separate register banks for scalar vs. SIMD registers. There is strong evidence that the underlying hardware is better-off when it manages scalar/simd register banks independently, and most newer ISAs have been designed with this goal in mind -- if nothing else, it greatly minimizes the effort needed to avoid false dependencies during scalar operations. Unless having a unified register bank is seen to make things easier on the programmer, then there seems little motive to prefer a unified register bank. From a compiler-author's standpoint there's no advantage to a unified register bank compared to scalar/vector register banks. In some ways the separate register banks make things easier (less dependency tracking and less clever register resource guessing required), though a unified bank can make an ABI simpler.

Secondly, separate register banks can allow for more total registers and/or shorter instructions. 32 scalar + 32 SIMD registers = 64 total, without needing an extra bit to encode 0-64.

In practice it is the exception when SIMD algorithms need to load values to/from scalar registers. The most common scenario is a 32-bit integer broadcast, and typically these can be setup outside of a loop. The most common reason to load values from SIMD to scalar is to obtain a mask result, which is then compared to some bitmask on the ALU. Having a "mask-move + imm32 test" integrated directly into the SIMD unit would avoid that inter-bank move.

Regarding 512+ Bit SIMD:

I see no value in trying to design a CPU ISA that can support 512 bits. I feel as thought it is a waste of CPU ISA design resources. Any algorithm that can go that wide is almost certainly better suited to a GPU ISA, where workgroups of 32 or 64 SIMD threads are fired off at-once (with each thread being 128 or 256 bit SIMD). There is not a foreseeable future where a GPU ISA is not available for use on a device that has a high-end SIMD 256+ capable CPU. The vast majority of data being processed is sets of 2 or 4 single or double-precision values. In all but the most obscure situations, it makes more sense to think in terms of many threads of 128-bit or 256-bit data, rather than trying to scale a single thread up to 512 bits. This is easier for the compiler, hardware, and programmer, and it's a no-contest in terms of performance. Does anyone really think that 512-bit AVX will come anywhere close to the SIMD throughput achieved by even modest integrated GPUs? I recommend to keep the CPU ISA simpler and instead there should be some effort focused on helping to tear down the remaining barriers to utilizing GPU for very-wide SIMD. Or using FPGA.

... of course Intel would never agree to such a design, since "more bits AVX++!" is their best bet to market new CPUs still. -_-

-Jake

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for an ideal extensible instruction set

Post by agner » Thu Nov 02, 2017 8:37 am

Author: Agner Date: 2016-01-12 07:18
Thank you for all the useful comments, ideas and references. This gives me reason to discuss several possible improvements:

Vector length register

The idea of using an extra register to specify the length of a vector is excellent. This makes it possible to change the length of a vector at run time and to use a length that is not a power of 2. We can still have a 3-bit vector length field in the instruction code, but the meaning will be as follows. A value of 0 means a scalar. A value of 1 - 7 means that the length is specified by one of the registers r9 - r15. It is the responsibility of the compiler (or assembly programmer) to make sure that the specified length does not exceed the maximum length supported by the actual CPU. This will make the hardware a little more complicated, but it will be a great advantage to the software.

The common system with fixed vector sizes has a big problem when vectorizing a loop through an array. Often, the programmer or compiler does not know in advance whether the array size is a multiple of the vector size or not. Therefore, it is necessary to make extra "remainder" code in the end to handle any remaining array elements that don't fit into the vector registers. Another big problem is that there must be different versions of the code for microprocessors with different maximum vector sizes if you want optimal performance.

A system that allows variable vector sizes can use the maximum vector size supported by the microprocessor for all but the last iteration of the loop and then use a smaller vector size for the last iteration if necessary. The performance of such a loop can be further improved if we make a special instruction that finds the vector size as the smallest of the two values: the remaining number of array elements and the maximum vector size. This is easier to implement in software than masking off the rest of the vector.

It will also be possible to make library functions that have vectors of variable length as parameters and result.

It will not be too complicated to implement a variable vector size in hardware. The hardware can simply mask off the unused part of the vector register. There may be a problem with power consumption. A processor can save power when handling a vector of less-than-maximum size by either clock gating the unused part or by turning off power to the unused circuits. If the vector size is specified by an extra register then the information about the actual size of the vector will be available at a rather late stage in the pipeline and the CPU will have less time to adjust its power consumption.

Another problem is the extra dependencies. An instruction can have up to three input operands, a mask register and a vector length register. This gives a total of five input dependencies. The out-of-order scheduling system becomes more complicated if it has to handle this many dependencies.

Nevertheless, I am sure that the gains in software efficiency more than outweighs the hardware costs of supporting a variable vector length.

Alignment of vectors

It is easier for the hardware to write and read vectors to/from memory if the vector size is a power of 2 and the memory operand is aligned to an address that is divisible by the size of the vector. Various systems have different requirements for the alignment of vectors in memory. The requirements for alignment makes the software more complicated. The stack is always the preferred place to store local data in a function. There are two common ways of aligning data on a stack to a value higher than the stack word size:
  • Keep the stack aligned by the required size and propagate this alignment through the chain of function calls. A function must insert unused space on the stack before calling another function, if needed, in order to keep the stack aligned. The problems with this method are: it wastes stack space; it requires extra instructions in functions even when the alignment is not actually used; it does not take into account future systems with bigger vectors requiring higher alignment; and it may fail if different parts of the code have been compiled for different alignments.
  • Adjust the stack frame to the required alignment inside any function that needs aligned storage. This requires more instructions than method 1, but only when alignment is actually needed. A further disadvantage is that is uses an extra register for addressing the aligned stack frame or for saving the stack pointer.
Both methods are in use. The current x86-64 systems use method 1 for 16 bytes alignment and method 2 for 32 bytes alignment. Method 2 is probably preferable in a system with plenty of registers.

A third possibility is to modify the hardware so that no alignment is required. Current state-of-the-art microprocessors have no performance penalty for misaligned memory operands, except when a cache line boundary is crossed. I am not in a position to weigh the hardware costs of handling unaligned memory operands against the software costs of aligning data, but this would be the optimal solution if the hardware costs are not too high.

Whatever solution we end up with, there are certain things that should preferably be coordinated: The alignment of memory operands, the stack alignment, the minimum vector size that must be supported, and the required alignment for arrays. If the hardware can support vectors aligned by 8 bytes efficiently, then we may decide the rule that vectors in memory must be aligned by 8 and that all arrays containing at least 8 bytes must be aligned by 8 so that they can be handled in vector registers.

Should vectors and scalars use the same register set?

Two commentators have argued that it is better to have separate registers for scalars and vectors, rather than just one register set as I first proposed. The scalar registers can be saved in a way that is sure to be compatible with future extensions because their size is fixed. This will make it possible to have rules for callee-save registers, i.e. registers that a function must save and restore if they are used. For example, we may decide that scalar registers r24-r31 have callee-save status. Such registers are useful for saving data across a function call. Vector registers cannot have callee-save status because it is difficult to save a vector register in a way that is compatible with future extensions of the vector size. Instead, we will use information stored in object files and static libraries in order to know which vector registers are modified by a function as explained in my first post.

While we may split the registers into vector registers and scalar registers, I don't want a further split into integer and floating point registers. In the case of vectors, there are many instructions that can be applied equally well to integer and floating point data, such as read, write, move, broadcast, blend, permute, gather and scatter. Masks generated with integer instructions can be applied to floating point data. Integer instructions can be used for manipulating floating point values, e.g. manipulating the sign bit, manipulating NAN and INF values, splitting a floating point value into exponent and mantissa, etc.

The same instructions can still be used for both scalars and vectors. A zero in the vector length field of an instruction can indicate that a scalar register is used instead of a vector register. A vector register can still be used for scalar operations if the vector length register contains the value 1. If the same instruction can be used with vector registers and scalar registers, then it follows that the scalar registers can handle both integer and floating point values, just as the vector registers. The only exception is that 128-bit integers and quadruple precision floating point numbers, if supported, can only be handled in vector registers.

The predicate/mask field can indicate either a predicate in a scalar register or a mask in a vector register, depending on whether the vector length field indicates a scalar or a vector. The vector length register is always a scalar register. Instructions that cannot be used with vectors do not need a vector length field.

Function calling conventions

The function calling convention and the order of parameters on the stack needs further discussion. First, we need to decide if the stack should grow downwards, as is common today, or upwards, which would be more logical. The tradition of making the stack grow downwards originates from non-protected operating systems where global data and heap grow up from the bottom of the RAM space while the stack grows downwards from the top until the two meet and the memory is full. This is less relevant in modern systems with virtual memory addresses and multiple threads with each their own stack. I have no compelling reason to prefer the stack to grow upwards or downwards, so let's defer this question. For now, I will assume that the stack grows downwards.

I have proposed that parameters on the stack should be stored in C order. This means that the first parameters are stored at the lowest addresses on a downwards-growing stack. If parameters are stored with push operations, then the last parameter must be pushed first when C order is used. Today, it is common to allocate stack space for the parameters first, and then store the parameters into this stack space, rather than using push instructions. The opposite order, which I will call Pascal order for lack of a better term, has the first parameter pushed first. Pascal order is more logical for assembly programmers if push instructions are used, while C order is more logical if the stack frame method is used because it has the first parameters at the lowest address.

The C parameter order was invented in order to facilitate function calls where the number of parameters was not specified, and in particular for functions with varargs, i.e. a variable number of parameters. I want to propose a completely different solution for varargs. Instead of putting varargs parameters on the stack or in registers, I propose to put them in a list which can be stored anywhere in memory, and only transfer a pointer to this list as a parameter. This has several advantages: No stack space is needed if the pointer can be transferred in a register. The length of the parameter list can be modified at run time. The parameter list can be reused for multiple function calls. And a function can easily forward its parameter list to another function.

If we allow the use of the first 24 scalar registers and the first 24 vector registers for function parameters, then we can have a function call with up to 48 parameters without saving any parameters on the stack. Varargs lists will not use the stack either if a pointer to a list is used. Now the order of parameters on the stack certainly becomes less important. Nobody will use assembly language to call a function with so many parameters, and few compilers will use push instructions, so the argument about the order of push instructions is irrelevant. However, there are still two arguments for using the C order: If the stack grows downwards then the C order will have the first parameter at the lowest address, which is more logical for both caller and callee. The second argument is that the first parameter will be closest the the address pointed to by the stack pointer. If the caller and callee disagree on the number of parameters because they are relying on different versions of a function prototype, or whatever, then the C order will make errors in the last parameters only while the Pascal order will have errors in all the stack parameters. The C order will therefore make it easier to locate such an error.

Another issue is whether the caller or the callee should clear the stack. I will argue that it is safer to let the caller clear the stack. If the callee clears the stack and the caller and callee happen to disagree on the amount of stack space used, then the stack will be corrupted and the system is likely to crash. But the system is not guaranteed to crash in this case. All kinds of unpredictable and disastrous things can happen. It is possible, for example, that a function pointer has been stored on a stack space where the caller expects to find a saved return address after the stack has been corrupted. This will cause it to continue in the pointed-to function and the result will be unpredictable.

Efficiency is not an issue here because the stack clearing rule is relevant only when calling a function with at least 25 parameters, and possibly 49. This is a very rare situation anyway. The adjustment of the stack pointer requires only a single instruction, which is likely to execute out of order. The impact on performance is zero or at least insignificant for such a large function.

The programming language is irrelevant here. These arguments apply to all programming languages. The efficiency of tail calls is not affected by who clears the stack. A tail call will save one stack clearing in both cases.

We need a rule for parameters that do not fit into any type of register. I will propose that such parameters should be stored in memory and a pointer to the parameter should be transferred as a parameter. The function is allowed to modify the data that the pointer points to, unless the parameter is declared const. It is the responsibility of the caller to call any copy constructor or destructor of the parameter. It is an open issue whether a function should be allowed to modify a parameter in a varargs list.

Link register

Some systems store the return address of a function on the stack, while other systems use a link register to hold the return address. The advantage of a link register is that a leaf function can be called without storing anything on the stack. This saves cache bandwidth in programs with many leaf function calls. The disadvantage is that every non-leaf function needs to save the link register on the stack before calling another function and restoring the leaf register before returning.

RISC-V specifies that the link register is one of the general purpose registers. I will argue that if a link register is used then it should be a special register. A link register does not need to support all the things that a general purpose register can do. If the link register is included as a general purpose register then it will be tempting for a programmer to save it to another general purpose register rather than the stack, and then end the function by jumping to that general purpose register. This will work, of course, but it will interfere with the way returns are predicted. The branch prediction mechanism in modern microprocessors use a special mechanism for predicting returns, which is different from the mechanism used for predicting other jumps and branches. This mechanism, which is called a return stack buffer, is a small rolling cache that remembers the addresses of the last calls. If a function returns by a jump to another register than the link register then it will use the wrong prediction mechanism, and this will cause severe delays due to misprediction of the subsequent series of returns.

The only instructions that are needed for the link register other than call and return, are push and pop. We can reduce the number of instructions in non-leaf functions by making a combined instruction for "push link register and then call a function" which can be used for the first function call in a non-leaf function, and another instruction for "pop link register and then return" to end a non-leaf function.

Tail calls will be equally efficient with and without a link register.

High level language support for vectors

C/C++ compilers often have support for vector registers. This includes system-specific types, e.g. __m256 for a 256-bit vector of 8 single-precision floats. If we want high-level language support for a system with variable length vector registers then we must have a system-specific addition to the programming language that defines such vectors, e.g. __vector<float,8> for a vector register of 8 floats. These types can be used as function parameters and function returns. The definition of a function with a vector parameter of a certain size implies that the CPU must support vectors of the specified size. Therefore, it is possible to transfer such a vector parameter in a single register and return such a vector in a single register regardless of its size.

It will also be useful to have a way of specifying vectors of variable size in a high-level language where the size is specified in a separate parameter.

Software pipelining

Support for software pipelining was proposed. This requires a rolling bank of registers that can be allocated to a loop. Software pipelining can improve the performance of complex loops in cases where out-of-order scheduling would otherwise be a bottleneck. Support for software pipelining would be a complicated addition to the architecture, and it would be irrelevant for the quite common cases where performance is limited by memory bandwidth, cache performance, etc. Some of the mechanisms are patented, which could be an obstacle to including it in an open standard.

I think that we should allow experimentation in this area and be prepared for a future extension of the standard with optional support for software pipelining.

Further discussion topics

RISC-V proposes a 128-bit addressing mode. I don't understand what such a huge address space can be used for. 64-bit addressing gives us more than 1019 bytes of address space. This is more than we will need in any computer in a foreseeable future. Some have argued that 128-bit addresses can be useful in CPU clusters and clouds. But I don't think that a CPU at one node in a cluster should be allowed to directly address a RAM cell at a different node. 128-bit addresses would mean that every entry on the stack would use 16 bytes, most of which would be zero. This would be a waste of precious cache space. The system would certainly be simpler if the only allowed addressing mode is 64 bits.

McCalpin proposes a reconfigurable register space. This would add a lot of complication to the instruction set as well as to the hardware. I am suspecting that the hardware necessary for supporting register reconfiguration would take up more silicon space than simply allowing all the registers to have full size.


Author: Jonathan Morton Date: 2016-02-02 17:19
At the moment, the biggest red flag I can see in this proposal is the use of "doubleword" to refer to 32 bits. Everywhere outside the Wintel enclave - including the ubiquitous IEEE-754 standard - "doubleword" means 64 bits, and 16 bits is referred to as "halfword". It's a Freudian slip which betrays a certain cognitive bias towards the same obsolete 16-bit architecture you're proposing to replace. Not what one wants to see when trying to look forward.

I'm also somewhat perplexed by your simultaneous assertion that a fixed-size RISC instruction word causes low code density, and the specification of a *minimum* instruction word size that is the same as a typical RISC instruction word size. The only efficiency you gain is the ability to use large immediate operands in a single instruction, but modern RISC ISAs (Alpha, AArch64, PowerPC) can already build an arbitrary 32-bit immediate in 2 instructions (64 bits), limiting your code density advantage to immediate operands larger than 32 bits. These are not so common.

I'll admit that combined load-arithmetic instructions can improve code density, but that comes at the expense of a more complicated front-end in hardware (or, for traditional in-order CISC, a more complicated pipeline) and has absolutely nothing to do with instruction word size. You do also gain a certain future-proofing flexibility by allowing longer instruction formats, but that has absolutely nothing to do with code density.

With that said, I have a different proposal for handling vectors, which I think is closer to the original Cray model. In this model, there are no architectural "vector registers", only scalars and "pipeline slots". Conceptually, the machine appears to repeat instructions a given number of times on successive data elements, but without an explicit branch instruction, similarly to the x86 "string" instructions.

The cleanest way to specify this I can think of is, oddly, similar to the x87 stack model. Now, x87 was a horrible model for high-performance arithmetic, because each instruction could do only one operation, it was impossible to specify software pipelining (executing more than one complex expression in parallel), and it was therefore hard to extract ILP at runtime. But it did allow specifying a single expression compactly and without explicit reference to register names. Substitute Forth as a mental model if you prefer.

Vector instructions would thus act as if on scalar values, with explicit load, store and pointer-update operations, referring to this stack for their virtual input and output operands. Instead of executing immediately, they would be stored in a buffer, and decoded into a pipeline of operations, with the expectation of operating this entire pipeline in multiple-parallel at maximum performance. The pipeline would implicitly be complete when the operand stack became empty; attempting to execute a pipeline with an imbalanced stack would be a trap error.

The complete pipeline would then be executed by loading the initial values into the relevant scalar registers (which were specified using "input" instructions), followed by a count value into a special-purpose register. The normal instruction flow would also continue, but a pipeline-wait instruction would prove useful.

Interrupts, including page faults, would not inherently disrupt this pipeline building or execution process, and would be able to use the scalar registers independently. It would be necessary to halt, save, restore and resume the pipeline state (whether empty, in the process of being built, complete, or executing) for context switching and page-fault handling.

The great advantage of this system is that the program need be aware of neither the number of operations the CPU can perform in parallel (which was an inherent flaw in Itanium, as well as with block-SIMD), nor the alignment requirements of the memory system (bar those of the individual data elements), without even needing to query them at runtime. An austere implementation could be entirely serial, operating like a standard for-loop over an x87, within a physical register set barely larger than that required to support the architectural scalar registers. A high-performance implementation might, in extreme cases, farm the pipeline out to something like a GPU - an idea which would certainly prick AMD's ears up.


Author: Agner Date: 2016-02-03 01:36
Thanks for your input.

I have no problem with using "word" to mean 32 bits and "halfword" for 16 bits.

Regarding code density. The problems with 16-bit instructions are many. You can't have 3-register instructions. You don't have space for specifying operand size and type, vector size, predicate or mask, rounding mode, exception handling, and all the other features that may be needed in the future. And quite importantly: memory address offsets and immediate constants have odd sizes in a 16-bit coding scheme. This causes problems in linkers and loaders when the offsets overflow, and it causes problems for the high-level language programmer who may not know whether a constant will fit into an instruction.

With a 16-bit minimum instruction word size, you will waste more bits for specifying instruction size. And instruction decoding will be a bottleneck like it is in x86. The nice thing about my proposal of 32-bit instruction words is that it allows a completely orthogonal instruction set. Any instruction can be specified with a register operand or a memory operand or an immediate operand - all with the same size, so that you can be certain that any value will fit into any of these, while small values can still be fit into smaller instructions to save code cache space.

If code density is important, then I can suggest a compromise. Allow two tiny instructions to fit into a 32-bit code word. The first 4 bits of the 32-bit word indicate that this is a double instruction, followed by two tiny instructions of 14 bits each. These tiny instructions obviously don't need any bits for specifying instruction size. They can be used for the most common simple instructions with one or two registers. A disadvantage is that you cannot jump to the second instruction of such a pair of tiny instructions. All jump offsets are still scaled by the standard instruction word size of 4 bytes.

Regarding your proposal of pipelined "string" instructions. They will have to either include memory addresses and work on the level-1 cache or use a register stack of fixed size. If your code has multiple accumulators or vectors then you need multiple register stacks. This sounds quite complicated to me. I am not sure I understand your idea.


Author: Jonathan Morton Date: 2016-02-12 06:25

Orthogonality is a good argument - I just thought it odd to discuss code density so much, and then do very little to improve it over, say, PowerPC. Whatever faults ARM may have, it does have far better code density than PowerPC - *even if* you ignore the 16-bit compressed format it supports. Some of this has to come from its support of, for example, a conditional-shifted-add in one instruction, where PowerPC must use three: either conditional-branch, shift, add; or shift, add, select.

The pipelined-string idea is a bit new and strange, that's true. I'll try to explain it a bit more.

There is only one stack, and it is only used at pipeline-setup time. It's really a stack of pointers into the vector register bank, which exists only as rename registers and is not directly accessible (which is why user programs won't need to know how big it is). The pipeline-setup engine essentially converts the pipeline instructions into an SSA basic block - in hardware - and stores the resulting uops (or whatever) in an internal buffer. Because this is only done at setup time, it can be relatively slow - one to three cycles per instruction, say - with the goal of maximum execution speed subsequently. The stack can thus be as large as can be justified, and will certainly be smaller than the actual register bank.

Memory addresses are collected from the scalar register bank at pipeline-execution time, and auto-increment versions of memory access instructions would be provided in the pipeline, inherently providing unambiguous prefetch hints. The same can be done with scalar arguments to the vector operation. If the updated version of auto-incremented scalars/addresses is defined to be written back to the scalar register bank after pipeline execution is complete, this might simplify restarting the pipeline after an interruption. If intermediate values in the pipeline can *also* be used as memory addresses, this would facilitate fast scatter/gather operations, which is a major weak spot of present SIMD architectures.

Complex pipelines will often need to use common-subexpressions and so on. The standard way to deal with this in a stack architecture is "duplicate" and "exchange" instructions, which should be familiar from x87. Because these would be handled entirely by renaming registers, which is done at setup time only, this would not affect throughput as it does with x87.

Obviously there are many details glossed over, but hopefully the gist of it is now clear.


Author: Hubert Lamontagne Date: 2016-02-18 19:32
Hmm, I think this is definitely a CISC, and not a RISC-CISC compromise. It does have the one good-but-kinda-expensive feature of CISC: Load-ALU operations.

Some criticism of the proposal:

Instruction format:
For the instruction size, I agree with Jonathan Morton: the second most useful instruction size after 32bits is probably 16bits, because most code uses mostly instructions that would fit in 16bits, resulting in a size gain of theoretically up to 50%. This is why so much 32bit ARM code is compiled in THUMB mode: the same code typically runs 0%-15% faster because the smaller instruction stream compensates for the extra operations needed to cope with the small opcodes (nb: this has probably changed on newer cores!). Instruction sizes larger than 32bits are not that useful because immediates larger than 16bits are rare, and if you have enough registers, you can load oversized immediates beforehand, and remaining operations with large immediates can be decomposed into 2 or 3 operations (such as add r0, 0x3423; add r0, 0x6543 * 65536). Also, large immediates tend to be multiples of 2/4/8/16/32/etc.., which is why ARM's scheme of taking smaller immediates and bitshifting them works. The other alternative to 32/16bit mixed instruction size for reducing code cache size is cramming 2 or more operations per 32bit opcode (which might actually be a good idea!).

I'm also not sold on load-multiple operations. The reason for this is that operations that write to multiple sources are generally bad. From the point of view of the register renamer and out-of-order execution engine, that's 2..N renames to keep track of, and 2..N writebacks to the register file. This means that instruction issue has to stall because subsequent instructions have to wait for all these registers to get renamed. You lose the simplicity of having each instruction be single-result only. That being said, ARM64's compromise of allowing a 2 target load-pair but not more sounds acceptable to me (since ARM64 has to deal with other multi-result instructions anyways).

Registers:
As far as I can tell, separate register files are GOOD. The reason for this is that as you add more read and write ports to a register file, its size grows quadratically (or worse). This makes cpu components larger, which increases propagation time, and increases fanout, and multiplies the complexity of the register renamer. If you give floating point operands their own register file, then aside from load/store, compare and conversion operations, the FPU never has to interact with the rest of the core. So for the same amount of IPC, say, 2 integer 2 float per cycle, separating float operations means you go from a monstruous 8-read 4-write register file and renaming mechanism where both integer ALUs and FP ALUs have to be wired everywhere, to a 2-issue integer unit and a 2-issue FPU. The FPU can have its own register renaming unit, its own scheduler, its own register file, its own writeback unit, its own calculation latencies, and FPU ALUs can be directly wired to the registers, and the whole FPU can live on a different section of the chip. The front end can simply recognize which ops are FPU and queue them there. The same applies to SIMD.

The reason why the integer register file of the CPU isn't also split is that integer operations have a lot of interactions with each other and with loads/stores and jumps, and getting C++ compilers to recognize which partition to put every op/result in quickly turns into an NP-complete problem. The exception to this is Ivan Godard's Mill's belt, which in a 4 ALU design forces each ALU to only write to a different 1/4th of the registers. It might be possible to make a good case for the 68000's idea of separating pointers into a different register file - after all, the C++ compiler knows which operands are pointers. Yes, this increases the number of opcodes (bad), but it decreases the amount of operations competing for the same register ports and ALUs for some given workload (good).

For the whole operand size thing, aside from SIMD and loads/stores there's no reason to have 8 or 16 bit operations. Inversely, 64bit operations where one of the operands is 32bits and gets sign-extended or zero-extended to 64bits are justifiable (ARM64 has them), including in address computations (there's tons of C++ code that does something like array[int index]). Some 32bit operations can be done in 64bit while ignoring the top bits (add, sub, mul, and, or, xor, shl) but not all (shr, asr, comparisons).

Flags:
Add-with-carry and subtract-with-borrow I think are unnecessary because they can be faked with 3 simple operations: add X with Y, compare the sum with X and output 1 if lower but 0 if larger (SLTU on MIPS), add comparison output to sum. ADC and SBC operations are problematic because they're really 3-input 2-output operations (bad), which means that they'll probably have to be broken down into 2 micro-ops which means you'll probably see little gain over the 3 instruction sequence.

Predicate registers:
I'm definitely not sold on the whole predicate thing. As far as I can tell, compilers really don't like issuing conditionals as anything other than conditional branches. Also, if the conditionals can be accurately predicted, then conditional branches are faster because you only execute one side of the branch, and operations downstream can get their inputs earlier (by register renaming, instead of waiting for the predicated instruction results). For remaining cases, a separate CMOV instruction sounds a lot more justifiable to me than spending 3 bits of every single opcode. Also, remember that predicated operations (3-input) are fundamentally different operations from non-predicated operations (2-input), since the old value can be propagated instead of simply being tossed so it needs to be present as an ALU input.

Rounding mode:
I agree that float-to-integer conversions must support at least truncation for the C++ (int) cast, plus floor() and ceil() and round(). Actually it would probably be useful to have an opcode that does floor() or ceil() without the integer conversion as well (for linear interpolation).

Exception control:
My point of view on exceptions is that they're generally bad, since they can basically turn any ALU operation into a potential conditional jump. This forces you to keep the CPU state at the moment of that operation until you're sure that the ALU operation went the right way. Also, they are useless for running C++ code. (Ok, to be fair, you already need to deal with potential page faults on every single load/store so it's not really that much more work, but this is definitely not the kind of thing I'd want to encourage)

Zero masked data:
The reason I can see for not putting this one in is that non-zero-masked ALU operations are actually very different operations and rather complex, since they prevent register renaming and forces you to implement 3-input result merging versions of basically every ALU operation (similar to the predicated operation above). These result-merging operations will probably see little use aside from manipulation tricks in hand-written assembly.

FPGA:
This is probably an okay idea for platforms like game consoles that essentially run a single program. This reminds me of the DSP on stuff like the N64, which you could rewrite the bytecode for (which only Factor5 ever did if I'm not mistaken). But otherwise, I think this is essentially impossible to task switch: you'd need to build the whole FPGA state to be loadable/storable which would probably make the performance pretty bad.


Ok, my turn with suggestions!

2 x ALU Instructions (sequential!):
Your 3rd operand can either be an immediate, register value or memory loaded value. I suggest adding a 4th option: letting this 3rd operand be the result of a simple 2-operand math operation (ie something like add, sub, and, or, xor, bitshifts, maybe mul...), potentially with a small immediate as 2nd operand. ARM already has something similar to this (except the only operation you can do is a bitshift), and the multiply-accumulate is also similar, and load-ALU operations can also be seen as a version of this. This is a very common sequence: a LOT of code is made out of 2-7 successive ALU operations on the same operand. The cost of this is that this is a 2-cycle latency operation reading 3 register ports. The benefits is that you're getting 2 instructions for the price of 1: you can squeeze it in a 32-bit opcode (=increased code density without resorting to 16bit opcodes), it's only 1 instruction for the front-end, it saves 1 register read over the equivalent 2 instruction RISC sequence (3 reads instead of 4). But most importantly, this saves 1 register WRITE, which lets you reduce the number of register renames for a given block of operations and reduce the number of write ports to your register file.

Software pipelining:
For software pipelining, I actually have an interesting design to propose, which is similar to Jonathan Morton's proposal but doesn't use a stack. I like to call "SIME" (single instruction multiple execution). For a 8*32bit SIME unit, you'd build it as 8 simple MIPS-like cores, each 1 instruction-per-cycle in-order. But only the first core has a front-end: once the first core executes an instruction, it queues the same instruction to the second core (which will in turn queue it to the 3rd core etc). For values involving feedback (for instance an accumulator), you also have a data queue going from the 1st core to the 2nd, and from 2nd core to 3rd, and so forth, with a queue from 8th core to 1st to provide looping, and ALU operands can come from either registers or from the queue from the previous core, and likewise results can be queued to the next core in addition to being written to the register files. All load/store operations are inherently gather/scatter operations since they execute sequentially on each successive core. For conditionals, cores 2-8 check that the conditional being evaluated produces the same result as on core 1, and if it doesn't, some fallback mechanism is activated (this is generally used to do a number of iterations that isn't a multiple of 8). When an interrupt/task switch/page fault happens, the OS needs special opcodes to load/store values from the instruction queues and data queues to save/restore the state. This system could be extended for larger CPUs by either making it superscalar (2-issue in order or even out-of-order), adding SIMD on top of SIME or adding more cores (it's probably not too hard to design it so that the number of cores can be changed without changing the instruction stream, aside from OS state loading/saving). Unfortunately I don't think C++ compilers can automatically produce the kind of loop that would run on this, due to the fact that memory loads/stores are reordered (unless either pointer aliasing detection gets much better, or a Transmeta Crusoe-style load/store aliasing resolution mechanism is provided).


Author: Agner Date: 2016-02-21 02:48
Thank you Hubert for your detailed feedback.

Regarding 16-bit instruction size: I don't think 16-bit is the optimal instruction word size for larger systems, and Moore's law is still making systems larger (it may slow down a little, but it hasn't stopped yet). A 16-bit instruction size means small immediate constants and small address offsets with odd sizes. Most memory addressing should be relative to the instruction pointer or the stack pointer. Using a double-size instruction (2*16 bits), you have 16 bits for address offset. This will give overflow during relocation in the linker or loader if the combined size of code + static data exceeds 32 kbytes (the offset is signed). Relocation overflows happened quite often in the old DOS days, and applications haven't become smaller since then. Most PC applications today are bigger than 32 kbytes, so you need 32 bit address offsets (or ugly memory segmenting). With 16-bit instruction words, you will need at least 5 instruction words for all instructions that access static memory. This means complicated instruction-length decoding. This is one of the reasons for my proposed compromise of a 32-bit instruction word size, and allowing two tiny instructions in a 32-bit instruction word.

Load/store multiple registers instruction: I am imagining that this instruction will be decoded into multiple micro-ops. The only purpose is to save code space.

Add with carry: Your proposal removes the carry output but not the carry input. It will be very complicated if you also remove the carry input: add A+B, generate carry out, add carry in, generate another carry out, add the two carry outs. This is 5 instructions instead of one. Add-with-carry is typically used in high precision math with long chains of add-with-carry. The latency of such a chain will be much longer if you don't have an ADC instruction. Most contemporary instruction sets have two outputs anyway: target register and flags.

Separate register files: My proposal does not have separate registers for integer and floating point - it has separate registers for scalars and vectors. Both can handle integer and floating point. Do you want to split it into four register sets: integer scalar, float scalar, integer vector, float vector? This will require a lot of cross couplings and extra instructions for converting between these. I think that I have more focus on vector instructions (SIMD) than you have. Performance-critical applications are increasingly using vector instructions because this is an efficient way to boost performance. I agree that typical non-vector code has few couplings between integer registers and floating point registers, except for address pointers. But vectorized software has more such couplings, especially for masks, but also manipulation of sign bits etc. Many instructions are the same for integer vectors and floating point vectors, as I have argued before: move, broadcast, blend, permute, gather, scatter.

Predicated instructions: I agree that conditional jumps are faster than predicated instructions if the jump is predicted correctly (but good branch prediction is very expensive in terms of hardware and power consumption). I included predication mainly for the sake of orthogonality between scalar and vector instructions. Masking is indispensable in vector code because you cannot make branches on a per-element basis. Predication is the scalar equivalent of vector masking.

8-bit and 16-bit ALU instructions: These are necessary in vector code, and so they are automatically included in scalar code as well. It would be a waste of power to use 64-bit ALU instructions for everything.

2*ALU instructions: Good idea. We already have multiply-and-add instructions. Double add instructions and shift-and-add instructions would be quite useful as well. Most x86 compilers are actually doing all kind of tricks with the LEA instruction (intended for address calculation) for doing two or three things with one instruction.

Exceptions: Yes, exceptions is a bad thing. It requires a lot of complicated machinery in both hardware and software. We can avoid the need for floating point exceptions by propagating INF and NAN values from the point of error to the final result of a calculation. The IEEE floating point standard includes an error code in the NAN which is propagated through the calculations. (The error codes should be OR'ed when two NAN values are added. Unfortunately, many microprocessors today fail to do this and only propagate one of the two NAN codes). It would be nice to have a mechanism for detecting errors in integer code as well, but I don't know how to do it. Most systems today generate an exception for integer division overflow, but not for overflow in integer addition and multiplication. This is illogical. We also need an efficient way of detecting if an array index is out of bounds.


Author: Hubert Lamontagne Date: 2016-02-22 13:48
Regarding 16-bit instruction size:
I admit, 16 bit instructions mostly made sense on RISCs that had to contend with having no instruction cache - in other words, ARM and SuperH. On a 32/16bit mixed size architecture, they'd exist mostly for series of arithmetic instructions operating on registers or very small immediates, and having 2-in-1 ALU operations solves this problem and has other benefits. But still, if you're going to go to the length of having a prefetch queue and barrel shifter for 4/8/12 byte variable instruction size, to me it doesn't seem like adding 2 byte increments is that much more work.

Load/store multiple registers instruction:
I guess that one depends on just how many microcoded instructions you have to deal with. To me, I think the #1 priority is making out-of-order C++ run as fast as possible, and setting a "single result register per instruction" limit helps a lot with this goal, because it removes a lot of degenerate cases like "4 two-result instructions in one cycle" (which means 8x register rename at front-end and 8x register writeback at back-end - this is bad).

Add with Carry:
It's true that if you want to do lots of BIGNUM computation, then you'll definitely want a flags registers (and 64x64->128 multiplies), whereas if you're doing general purpose C++ on an out-of-order cpu, you never need flags and multi-result instructions just aren't worth the extra trouble (which is why MIPS never had them). I have a weird suggestion here: BIGNUM computation will probably always happen on the SIMD unit, so you probably only want flags on the SIMD unit. Or perhaps you could use scalar integer registers as flags on SIMD operations designed for BIGNUMs.

Separate Register Files:
For SIMD code you could definitely have both float and integer vectors in the same register file, and a lot of instruction sets do this. In fact, you could have a register set for SIMD integer+SIMD float+scalar float (this is what ARM does) and it would be usable. I guess I was arguing specifically for not mixing scalar integers and scalar floats.

Predicated instructions / 8-bit and 16-bit ALU instructions:
I think that integer scalar and SIMD operations shouldn't be orthogonal. They don't really need to be, and they're optimized for different things (C++ compiler code and heavy out-of-order execution for integer scalar, maximum throughput at the cost of increased latency for SIMD). So predication and 8/16 bit data should probably be restricted to SIMD units.

Exceptions:
This is the kind of stuff that does pop up occasionally: MIPS and PA-RISC have versions of ADD that trigger overflow interrupts, x86 has the BOUND instruction (but CPUs typically don't optimize for it - for instance it issues on the vector path on the Athlon and it has 6 cycle latency). The problem is that C++ doesn't use these (+ - * are expected to wrap, and the way C++ conflates arrays and pointers prevent bounds checking most of the time), and higher-level languages typically have to do fancy fallbacks (try {} catch() and so forth) which precludes something as blunt as an interrupt. BOUND is essentially racing against an easy-to-predict conditional branch (which becomes free if the CPU isn't issuing full IPC at any point in the loop).


Author: Agner Date: 2016-02-23 02:12
Hubert Lamontagne wrote:
Add with Carry:
It's true that if you want to do lots of BIGNUM computation, then you'll definitely want a flags registers [...] Or perhaps you could use scalar integer registers as flags on SIMD operations designed for BIGNUMs.
Most CPUs today have a flags register which is updated at every ALU instruction, even if it is rarely used. So most ALU instructions have two result registers in current designs. The flags register also needs renaming. My proposal has fewer two-result instructions than most current designs because you only have a flags output when you explicitly ask for it, and you don't need flags for conditional jumps.

If you want to eliminate two-result instructions completely, then we have a problem with add-with-carry. I am not sure I understand your proposal. Do you want to do a bignum computation using the whole vector register as one huge integer? This would be tempting indeed, but it can't be done in a single clock cycle unless you implement some very heavy carry-look-ahead circuitry. And you will still have a two-result instruction.

Some kind of software pipelining might be the most efficient solution to the add-with-carry problem, but I am not sure we have found a software pipelining model that is sufficiently flexible and not too complicated. Maybe the vector registers can be used as software pipelines so that the data are shifted one vector position at each clock tick.

Regarding 16-bit instruction size:
Instruction length decoding is a serious bottleneck in x86 processors. We should have as few different instruction lengths as possible.

Separate Register Files:
The vector registers cannot have callee-save status because the length is variable. This is a problem if you want to reserve the scalar registers for integer instructions only and do all floating point operations in vector registers. Then you have no floating point register with callee-save status, unless you define callee-save status for part of the register only (x64 Windows does this).

Exceptions:
The x86 BOUND instruction was removed in 64-bit mode, apparently because it was inefficient and rarely used (the code byte has been reused for something else). Checking array bounds can be done quite simple with a compare-and-conditional-jump instruction, which is already included in my proposed instruction set. If you are using unsigned compare then you don't need to check the lower bound (assuming that it is zero).

Checking for integer overflow is very complicated in high-level languages (see http://stackoverflow.com/questions/1993 ... low-in-c-c). We could make it easier by improving support for checking integer overflow in the instruction set. I have already proposed to use the predicate/mask register for flags also. We could add two more flags bits which are accumulating, i.e. the signed and unsigned overflow condition is ORed with the previous value of the flag bit. The program can then check these accumulating overflow flags after a series of instructions. The compiler may use multiple registers for overflow flags in cases where the extra dependencies would prevent out-of-order execution. These flags registers are then ORed together in the end when you need to check for overflow. This mechanism might also be used for floating point errors as an alternative to the NAN-propagation mechanism. High level language support can easily be implemented with try-catch statements:

Code: Select all

try {
  ... a long series of calculations ...
}
catch (signed_integer_overflow e) {
  ... error message ...
}
I don't know if this is sufficiently useful to justify the cost of having more two-result instructions.


Author: Hubert Lamontagne Date: 2016-02-23 14:47
Add with carry:
SIMD operations can be run with higher instruction latency than integer instructions (which have to run in 1 cycle or else they tend to bottleneck everything else). For instance, VADD has a 3~4 cycle latency on ARM. BIGNUM processing tends to have other longer latency operations like large 64x64->128 multiplications, so you could live with a 3~4+ cycle vector ADC as well. ADC could be a 4 input instruction: operand_a, operand_b, operand_a_of_previous_computation, output_of_previous_computation (instructions with lots of inputs are relatively common on SIMD instruction sets). This can even be chained, and the CPU's register renaming engine can totally take care of adc-to-adc dependencies.

Among 'modern' architectures (which I'd define as 'architectures that have at least 1 fast out-of-order implementation'), MIPS doesn't have flags at all, Dec Alpha doesn't have flags at all, PA-RISC has a couple bits in the processor status word but conditional branches don't use those (carry flag strictly for adc/sbc/add*/sub*, multi-step division flag, nullify flag that skips over next instruction), ARM has flags but only some instructions set the flags (CMPS, SUBS, arm32 instructions with the 'S' bit) and it doesn't have partial flag updates, x86 infamously has flag partial updates on every ALU instruction (which means it needs multiple aggressive rename units), POWER has an 8 field condition register with ALU ops optionally updating field 0 and CMP updating a selected field (in addition to a count register), Itanium has the 64 single-bit predicate registers (supposedly the one thing that prevented an Intel team from making an out-of-order Itanium!).

So I guess it's a bit of a wash but I don't think flag registers make cpus faster (Alpha didn't need flags to be fast!).

Regarding 16-bit instruction size:
Agreed, multiple instruction size is bad unless you have no choice. I'd still argue for a single 4-byte instruction format: lots of fast architectures use it (Alpha, MIPS, PA-RISC, Power, ARM64), instructions with large immediates are rare and they are generally easy to split into multiple instructions. Adding 8-byte and 12-byte instructions doesn't sound like a large increase in complexity, but it is: it means instructions can span more than one cache line (= you need a prefetch buffer = your pipeline becomes at least 1 or 2 cycles longer), the second instruction of an issue group can be located in multiple different positions which means you need more multiplexers (+0, +4, +8 bytes) and this problem increases for every successive instruction (the 4th instruction can be at +0, +4, +8, +12, +16, +20, +24), it adds pipeline stall checks for cases where there are simply too many large instructions and the icache can't keep up.

Separate Register Files:
Then it's probably best to have an integer register file, floating point register file, and vector register file yes.

Exceptions:
I still think that's spending an awful lot of silicon in parts of the cpu that are the most sensitive to timing, for something that I think isn't going to see any use because it isn't even in C++ aside from intrinsics, it prevents the compiler from reordering SSA (it makes + non associative!), and it can be simulated with a couple extra MIPS-style ops.


Author: Agner Date: 2016-02-24 00:58
Add with carry:
Your idea of a 4-input add with carry is interesting. I see a few problems, though:
  • You will need more space in the opcode to contain 4 registers.
  • Handling multiple inputs is just as difficult as multiple outputs. For many years, Intel had a limitation of 2 inputs per micro-operation, and they had to split add-with-carry and several other instructions into 2 microoperations for the same reason.
  • The instruction might have a latency of 2 clock cycles unless you can implement it with two double-speed adders. Mixing instructions with different latencies is a problem because a 2-clock instruction may need the result bus at the same time as a subsequent 1-clock instruction. It is best to standardize instruction latencies and have as few different latencies as possible.
  • It is problematic to make a hardware design that cannot handle instructions with two outputs. You also need two outputs for integer division (output quotient and remainder) and for full length integer multiplication.
I can see the following possibilities for implementing 2-output functions:
  • Use one instruction with two output registers.
  • Use two separate instructions, possibly executed simultaneously, one for each output.
  • Use two elements of a vector register.
Method 1 would certainly be the most efficient and straightforward solution. We just have to weigh the hardware costs versus the benefits.
Method 2 is less efficient. For example, for integer division, you cannot make two divisions simultaneously, or even pipelined, unless you double the hardware.
Method 3 might be an efficient solution for a scalar add-with-carry chain, but for other purposes, it will complicate the software when you have to split and join data into vector elements. It becomes more complicated when you want to vectorize vectors. You may use even-numbered vector elements for addend and sum, and odd-numbered vector elements for carry. This will complicate both hardware and software, and the throughput will be half of what you would get with method 1.

Instruction length:
As I argued before, you need 32 bits for address offset, so you must allow instructions of two 32-bit words.

Many instruction sets don't allow big immediate constants. For example, to load a 32-bit constant you need a memory operand with a 32-bit offset. My argument is that it is more efficient to have a 32-bit immediate operand than a 32-bit offset to a 32-bit memory operand. This will reduce the loading on the data cache. Remember that cache misses are very expensive.

In my analysis I found that you may need instructions of three 32-bit words to accommodate all the bells and whistles of vector instructions with a memory operand, variable vector length, mask, etc. There may be more needs for long instructions in the future as new features are invented. At least the current trend goes towards putting more features into a single instruction to get higher overall performance. This is the reason for my decision to allow instruction lengths of one, two and three 32-bit words. This is certainly a compromise, since instruction length decoding becomes more expensive the more different instruction lengths you have.

If we decide to allow an instruction length of 3*32 bits then we can afford the luxury of allowing immediate constants of 64 bits, for example a double precision float or a 64-bit absolute address.

On the other hand, if we limit the instruction length to 2*32 bits, then there will be certain instructions that cannot have a memory operand with 32-bit offset, and we will need two instructions to load a 64-bit immediate constant. This would still be a viable solution, but I suspect that a patch would be added in the future when the need for more instruction bits arise as more features are added. Remember how many patches have been added to old instruction sets.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for an ideal extensible instruction set

Post by agner » Thu Nov 02, 2017 8:40 am

Author: asdf Date: 2016-02-24 04:08
Checking for integer overflow is very complicated in high-level languages
How about you support 2 kinds of arithmetic - saturation and modulo. To check if the arithmetic operation overflowed, do this:

Code: Select all

c = a modulo_op b;
d = a sat_op b;
if (c != d) { overflow }


Author: Agner Date: 2016-02-24 13:04
asdf wrote:
How about you support 2 kinds of arithmetic - saturation and modulo. To check if the arithmetic operation overflowed, do this:
c = a modulo_op b;
d = a sat_op b;
if (c != d) { overflow }
This can easily be implemented. x86-SSE2 has saturated addition.

It is less efficient than my proposal, though, because you need almost 3 times as many instructions.

The method doesn't work for multiplication, though. It can happen that c = d after an overflowed multiplication.


Author: Agner Date: 2016-02-25 02:13
Agner wrote:
We could add two more flags bits which are accumulating, i.e. the signed and unsigned overflow condition is ORed with the previous value of the flag bit. The program can then check these accumulating overflow flags after a series of instructions.
I wonder if we can get rid of the floating point control word and floating point exceptions with this method.

First, the rounding mode should be specified in the instruction that needs it, not in a global control word. Only few instructions actually need a specified rounding mode, most importantly float to integer conversion. In the rare case where you need a specified rounding mode for addition, multiplication, etc. you will have to use a long version of the instruction with all the option bits.

Floating point errors can be detected in most cases by the INF and NAN propagation mechanism. In the cases where a more detailed error detection is needed, we will specify a combined predication/flags register in the instruction. Bit zero of this register is the predicate. A few more bits are the traditional flags: zero, carry, sign, overflow. And then I want to add a few accumulating error bits which indicate the error condition ORed with the previous value of the same bit. This mechanism also works with vectors, where the flags register is a vector register.

The advantage of this proposal is that we get rid of the floating point control/status register and floating point exceptions. The disadvantage is that we get an extra input and output dependence when a predication/mask/flags register is specified.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

limit instruction length to power of 2

Post by agner » Thu Nov 02, 2017 8:44 am

Author: A-11 Date: 2016-02-24 12:49
How about limiting instruction length to power of 2 - 1 byte, 2 byte, (not 3 byte,) 4 byte, (not 5,6,7 byte,) 8 byte, 16 byte and so on?

We know, although current x86 instruction length is nightmare - vary from 1 byte to 15 byte increasing by 1 byte, aligning instructions to 16 or 32 byte boundary helps fetch unit of x86 processors.
Data types of C language already obey this rule.
There are 1 byte("char"), 2 byte(typically "short int", "wchar"), 4 byte(typically "int", "float"), 8 byte (typically "long int", "double"), but not 3 byte primitive types.
Then these data is aligned along the boundary of their size.
if instructions are aligned along power of 2 boundary too, we can use efficient fetcher (and maybe decoder).

As long as processors use binary notation, power of 2 is primitive size for processors.
So extending instruction by power of 2 costs few efficiency.
I think instructions with 12 byte length will be same disaster in future as Intel introduced 3 byte instructions to be the disaster today.


Author: Agner Date: 2016-02-24 12:57
A-11 wrote:
How about limiting instruction length to power of 2
If you have, for example, 8B - 2B - 8B, then the second 8B instruction will be misaligned, and the advantage disappears.


Author: Hubert Lamontagne Date: 2016-02-24 17:50
Any techniques for more than 2 loads per cycle?
Going to go on a tangent here but how could gather/scatter be implemented in hardware? The 'traditional' way to implement data cache seems to be to have 2 read ports and 1 write port with banking (and if your 2 loads fall on the same bank you only get 1 load), with aggressive reordering, but obviously this limits scatter/gather issue width a lot (probably making anything more than 4 or 8-way gather/scatter useless). Increasing the number of L1 ports causes tons of problems:
  • It makes bank selection for loads more complex, potentially increasing load latency by a cycle (presumably from 3 cycles to 4 cycles with address calculation included) due to having more multiplexers on address inputs on each bank, more multiplexers on writebacks, more different stalling scenarios and probably requiring an increase in the number of banks.
  • It makes load/store address conflict detection harder since you need to check even more reads against pending writes in the write buffer, and deal with more scenarios like multiple reads trying to access forwarded store values.
    I've played around with various concepts to deal with this but I'm not sure I've found anything really interesting yet:
  • A L0 cache could be introduced. Probably something very small, single-way, duplicated multiple times, probably loading whole cache lines from L1 on every miss and probably only used when there are too many loads per cycle to be satisfied by the L1. Problems: this is still limited to 1 store per cycle, filling values from L1 competes with stores for the single write port, doesn't simplify address conflict detection with the store queue. (if I'm not mistaken, GPUs use something like this?)
  • Pointers could be stored in special registers, and when a pointer register is updated, data from nearby addresses (say, possibly something like adr+0 to adr+63) are automatically pre-read into registers, and there is an automatic check that none of the other pointer registers are pointing to the same data with data modifications. You would possibly also have load/store instructions that bypass these special pointer registers (but with address conflict checking). This is very complex (especially the address checking, which is unfortunately necessary for C++ compilers), and it doesn't help you at all if your data is widely spaced or uses indexed offsets (register+register*n). But on the other hand, data accesses that do fall into this pattern (like loading/storing a whole bunch of contiguous stack addresses or object member variables) become register accesses, they can be renamed, reordered willy-nilly, pretty much every instruction can load/store a value, misaligned addresses don't matter anymore (except when changing a pointer register), and if the address is divisible by 64 you can conceivably load/store a whole cache line in one go.

Author: Agner Date: 2016-02-25 01:40
Hubert Lamontagne wrote:
how could gather/scatter be implemented in hardware?
I don't know any microprocessor that can gather/scatter all vector elements simultaneously. It will use multiple clock cycles and gather one - or at most two - vector elements per clock cycle.

I am using a trick when the data to gather are not too distant from each other in memory: Read contiguous data into the largest available vector register, and then use a permute instruction to get the data into the desired positions in the vector. We should of cause have efficient permute instructions that can move data from any vector position to any other vector position. The indexes for permutations are provided in another vector register. An index out of range should produce a zero, so that larger permutes can be produced by ORing the results of multiple permute instructions.


Author: A-11 Date: 2016-02-25 07:20
If you have, for example, 8B - 2B - 8B, then the second 8B instruction will be misaligned, and the advantage disappears.
Same thing happens at data structure.

Code: Select all

struct {
double a /* 8B */;
short b /* 2B */;
double c /* 8B */;
} foo_t;
For this structure, some C compiler arranges "8B(a) : 2B(b) : 8B(c - misaligned!)".
But most compilers generate "8B(a) : 2B(b) : 6B(padding) : 8B(c)" for this data block.
So I reply "8B - 2B - 2B(NOP) - 4B(NOP) - 8B" for the instruction case.
Deffer from C structure where this language prohibits rearranging member order, compilers can reorder instrutions to bury NOP paddings.
For example, from "8B - 2B - 2B(NOP) - 4B(NOP) - 8B - 4B - 2B" to "8B - 8B - 2B - 2B -4B".
So we must worry about 6B for 2 NOPs only in case this block is separated by jumps, where we would pad NOPs anyway for instruction alignment.

I think what we must worry is not only explicit NOPs above, but also implicit unused bits in a instruction.
MIPS-I has 4 fields of 5 bit width for each instruction.
But because of 3 operands architecture, it tends to use only 3 fields leaving 1 field unused.
In my proposal, your 12B instructions have to bloat till 16B, which imply wasting 4B.
According to data compression theory, as predictable these bits is, they holds as less information.
I guess this is a reason why RISCs have lower code density.
It's the trade-off between density and fetch speed of code.
Also from the theory, as we stuff more information in bits, these bits are less predictable = more randomized.
The nightmare of x86 random format might be a proof of high code compression ability.

My understanding for "extensible instruction set" is how to keep old instruction set in the future which is new instruction set we are designing today.
Like you, I also can't imagine the day when 32bit alignment is too small.
But Intel also could not too, and they supposed 8bit alignment is enough, resulting today's nightmare.
So I think we must not imagine enough scale of alignment.
Exponential size is scale-free like fractal.


Author: Hubert Lamontagne Date: 2016-02-25 10:23
Agner:
Doing one large load then using parts of it is cool yes, although I kinda expect that it's hard to end up with a net speed gain over simple scalar code when doing that sort of thing.

A-11:
That adds some hard decisions in C++ compilers : it forces the compiler to potentially reorder things (for instance, grouping 2B instructions together), possibly encourages the compiler to heavily favor small instructions (maybe even breaking down large 8B instructions into two or three 2B instructions). Though it also has some benefits: it gives you multiple instruction size without needing a prefetch buffer and it's easy to decode. And I guess you could design it so that 2B instructions generate a single micro op, 4B instructions can generate two micro ops, and 8B instructions can generate 4 micro ops, which would let you align instruction queue inputs to instruction cache outputs, and multi-output instructions could be forced to use larger encodings to ration register write ports.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

More ideas

Post by agner » Thu Nov 02, 2017 8:50 am

Author: Agner Date: 2016-03-04 11:16
I have an idea that would make it very easy to optimize array loops:

Define an addressing mode [register1 - register2]
Specify vector length (in bytes) in a register. If the specified value is higher than the maximum vector length supported by the processor then the maximum length is used.

Now we can loop through an array in this way:

Code: Select all

P = address of array
J = size of array (in bytes)
L = maximum vector length (depends on processor)
X = a vector register
P += J;   // point to end of array
while (J > 0) {
   X = whatever_operation[P-J]{vectorlength J}
   J -= L
}
Here, J has the triple function of loop counter, array index, and vector length. The array size does not have to be a multiple of the vector size: The last iteration of the loop will automatically use a lower vector length if required, and no extra instructions are required to calculate the remaining size. We have completely got rid of the extra code that is typically needed to handle the remaining array elements when the size is not a multiple of the vector length. The code will work optimally on different processors with different maximum vector lengths. There is no need to recompile the code when a new processor with a different vector length appears on the market. Obviously, we can read and write any number of arrays inside the loop, using the same method.

If we don't want to have too many different addressing modes, we can maybe ditch the addressing mode with a scaled index register, assuming that the above method will be used for most loops.

And one more proposal. There is a trend to add more and more feature bits to instruction codes, such as rounding mode, exception control, broadcasting, permutation, shifting, zeroing. This makes instruction codes longer, and it is a waste of code cache size because most of these bits are rarely used. I will propose to put some of these extra feature bits into a register. I have already specified a predicate/mask register in my initial proposal. The extra feature bits will be specified in the same register. Now, we will have only one "enable-features" bit in the instruction code, which enables the extra feature bits in the register. If the "enable-features" bit is zero then all but the predicate/mask bit in the register are ignored.

All unused bits in the features/predicate/mask register are reserved for future use, and must be zero.

Some of the bits in the features/predicate/mask register can be output bits. They can be used for flags (carry, zero, sign, overflow) and accumulating error flags as I proposed in a previous post. The output bits will be unchanged when the "enable-features" bit is zero in order to save a register renaming.

Only features that influence the scheduler and renamer need to be hard-coded into the instruction. Some of the feature bits will not be available for vectors with 8-bit granularity if we don't have enough bits.


Author: Hubert Lamontagne Date: 2016-03-07 10:57
I love your idea of using the remaining number of iterations, clamping it to the SIMD width and using that as a per-iteration width. I have to admit that's how a lot of my block processing code looks:

Code: Select all

for(int i=0; i<nb_samples_to_do;)
{
int block_samples = nb_samples_to_do - i;
if(block_samples > 64) { block_samples = 64; }
[process block_samples items];
i += block_samples;
}
I think the scaled indexed addressing mode is mainly there for another reason: reading look-up tables, and other cases where the array index is calculated on the fly inside the loop (2D texture mapping, audio resampling and so forth). This addressing mode mostly makes sense for scalar integer and floating-point operations though (and scatter/gather operations if you end up having those).

For SIMD code, you tend to have free leftover cycles on the integer scalar part of the cpu so I don't feel that addressing modes are all that important - on the ARM NEON code that I did, I could simply do pointer updates and recalculations for the addressing types that the NEON didn't allow because performance was limited by the NEON unit anyways. On the other hand, you can also allow fancier addressing modes like post-increments in SIMD code because they don't use the same register files (MIPS has this: integer loads/stores are register+offset ONLY, but floating point loads/stores also allow register+register*4 since it doesn't create the case where store operations need 3 input registers; ARM NEON has a post-increment addressing mode where the increment is the SIMD load width).

Come to think of it, would it make sense to adapt the code for different maximum SIMD vector length in the relocation pass? (ie when correcting all jump offsets when loading a DLL or doing address layout randomization to prevent hacking when loading executables)


Author: Agner Date: 2016-03-08 01:52
Hubert Lamontagne wrote:
would it make sense to adapt the code for different maximum SIMD vector length in the relocation pass? (ie when correcting all jump offsets when loading a DLL or doing address layout randomization to prevent hacking when loading executables)
The Gnu loader actually has this feature, called gnu indirect function. It can make entries in the procedure linkage table (PLT) point to different versions of a function depending on, e.g., which instruction set is supported. This feature is useful, but poorly documented. My idea is that you need only one version of the code and it will work optimally on all processors regardless of their maximum vector size. With current systems, you have to make a new version of the software every time a new processor with an improved instruction set comes on the market.


Author: Agner Date: 2016-03-09 10:47
The idea of supporting vector registers with variable length has important consequences for the instruction set architecture as well as for the entire ecosystem of compilers, function libraries, etc. I will discuss my thoughts about this here.

First, the register set. We have discussed whether there should be different registers for integers and floating point numbers, and for scalars and vectors. So far, the following solutions have been proposed:
  • One universal register set for everything.
  • Two register sets, one for scalars and one for vectors. Same registers are used for integers and floating point.
  • Two register sets, one for integer scalars and one for everything else: floating point scalars, integer vectors and floating point vectors.
  • Three register sets, one for integer scalars, one for floating point scalars and one for vectors of all types.
The reason for using the same vector registers for integers and floating point numbers is that they share many of the same instructions, as mentioned in a previous post.

If we assume that a lot of floating point code involves arrays and loops, then we must prioritize easy vectorization of floating point code. If we assume, furthermore, that a lot of floating point code contains calls to mathematical function libraries, then we must make these library calls vectorizable. Mathematical library functions such as sine or logarithm should have a variable-size vector as input and a similar variable-size vector as output. It will be simpler to use the same functions for scalars by specifying a vector length of one, rather than having separate function versions for scalars and vectors. This will make it easier for an optimizing compiler to convert function calls in scalar code to vector code. A consequence of this is that we should use the same register set for floating point scalars and floating point vectors.

A drawback of using vector registers for scalars is that vector registers cannot have callee-save status because the vector length is variable with no theoretical upper limit. We must find out if scalar non-vectorizable floating point code is sufficiently common to justify having a separate register set for floating point scalars. For integers, on the other hand, there is no doubt that scalar code is common. We need scalar integer registers for pointers, loop control, and all kinds of general code. This leaves us with option 3 above as probably the optimal solution: one register set for integer scalars, and another register set for floating point numbers and vectors.

The priority on vector support, variable-length vectors, and variable length vector functions has important consequences of the whole ecosystem of compilers, function libraries, etc. We must define an ABI standard that supports functions with variable-length vectors. If registers r9 - r15 are used for specifying vector length, as proposed, then it will be natural to use these registers also to specify the vector length of function parameters and function returns. If multiple vector parameters have the same length (in bytes), then they should use the same vector length register, r9. If multiple vector parameters have different length then they will use r9, r10, etc. If there are more than 9 scalar integer parameters before one or more variable-length vector parameters, then the vector length will have precedence over the scalar integer parameters for the use of r9 - r15.

The vector length is specified in bytes in the vector registers and in assembly code because this makes loops more efficient (we can use the same register for loop counter, array index and vector length as explained in my previous post). High level code differs from this by specifying the vector length as the number of vector elements. The compiler can easily translate this to bytes, like it is already doing for array indexes. A function that uses multiple vectors of different kinds should preferably have the same element size for all vectors, i.e. 64-bit integers if you have double precision floats.

This system also needs special support in compilers. As a minimum, we need a way of defining functions with variable-length vectors as parameters and as return value. Many contemporary compilers already have a way of specifying fixed-length vector registers as parameters and variables.

The problem that vector registers cannot have callee-save status can be met by making an addition to the object file format that allows a libraray function to specify which registers it is modifying. A compiler that supports whole-program optimization can use this information at the register allocation stage to avoid the need to save registers across calls to library functions with static linking.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal for an ideal extensible instruction set

Post by agner » Thu Nov 02, 2017 8:56 am

Author: Joe Duarte Date: 2016-03-07 15:18
Very interesting Dr. Fog. I love clean-sheet approaches to ISAs and microarchitectures, since we've been stuck with the same dated monoculture for so long.

Regarding your proposal, what do you think about a transport-triggered architecture? Could your framework benefit from elements of a TTA? Instruction and code size could be reduced if certain registers automatically squared their contents, for example, or if designated pairs of registers automatically summed their contents. There are many possibilities. (Is there already an incrementer (by 1) register in existing architectures? Someone told me there was, but I've not seen it. That would be handy for some loops.)

Relatedly, do you think it would be worthwhile to have certain constants hardcoded into processor? I'm thinking of things like pi, Euler's, sqrt(2), and others. The actual constants to include would be best determined empirically, based on which are most used by programs in this era. Might there be useful speed-ups if a given register multiplied its contents by pi, for example? I have no idea what the physical engineering in the silicon would entail -- perhaps TTAs present difficulties?

Cheers,

Joe Duarte


Author: Agner Date: 2016-03-08 02:11
Joe Duarte wrote:
what do you think about a transport-triggered architecture? Could your framework benefit from elements of a TTA? Instruction and code size could be reduced if certain registers automatically squared their contents, for example, or if designated pairs of registers automatically summed their contents. There are many possibilities. (Is there already an incrementer (by 1) register in existing architectures?
A full transport-triggered architecture, as defined by Wikipedia, is very dependent on timing; and the software needs to be recompiled for different CPUs with different latencies in the execution units. I don't see how it can handle out-of-order execution. I think the explicit parallelism will suffer every time there is a cache miss. Your idea is not as radical as this, but it will be very application specific. You cannot use a register that is tied to a specific ALU function in applications that don't need this function. Instructions with read and increment pointer are available in many architectures.

Hard-coded constants like pi etc. are available for the x87-style floating point registers in x86 processors, but these registers are now mainly obsolete and replaced by the vector registers, which don't have this feature. Apparently, the advantage was too small. My proposed instruction set allows immediate floating point constants of both single and double precision. I think this will serve the same need.


Author: Hubert Lamontagne Date: 2016-03-08 12:39
Joe Duarte: Looking up transport-triggered-architectures.... They are interesting but I think they share most problems with VLIW:

Programs typically look kinda like this:
  • Load from memory #1
  • Long chain of dependent math #1
  • Store to memory #1
  • Load from memory #2
  • Long chain of dependent math #2
  • Store to memory #2
To run in parallel, you have to run the math chain #1 and #2 at the same time. Since chain #2 ops depend on load #2, you have to move load #2 up before store #1. This forces the compiler to prove that load #2 and store #1 can't possibly fall on the same memory address (this is what the LLVM alias analysis does), which turns out to be a hard problem often requiring global analysis and often fails. VLIW architectures often have software alias detection to deal with this: the Transmeta Crusoe had Load-Lock, Store-Check, Commit(+jump to fallback if commit fails); Itanium infamously had the ALAT where you'd do a ld.a (advanced load), then later on a ld.c or chk.a to confirm that the data from the ld.a isn't baloney and branch to fallback code if it is.

Out-of-order architectures are popular because they do this automatically for you (the store operations calculate the target address on the spot but can wait for the value for many cycles). Also, what if a load doesn't fall into L1 cache? On a VLIW (and, presumably, TTA, unless you made all your transport use queues), this is a hard stall. Out-of-order architectures can at least somewhat reorder operations around this - with some luck, hopefully it can find enough operations to do until it can get an L2 cache result.

Other problem is, as Agner said, that it's hard to adapt code written for a gen 1 TTA cpu to some presumably larger gen 2 TTA cpu - you'd probably need to more or less dynamically recompile it to the new wider cpu, which is easily as complex as current out-of-order RISCs and ARMs and x86s.

That being said, supposedly NVidia's Denver pulls off VLIW correctly and gives good perf, so I guess it is possible to make this work.

---

For built-in constants, that's not so useful because a lot of constants also have some scaling built-in (for instance result = sin(2.f * 3.141592653f / 256.f * i); ) and most of the time if your calculation involves pi or e, it involves some very slow op like sin() or exp() so having to load one more constant from cache won't slow down things appreciably.


Author: Joe Duarte Date: 2016-03-09 19:58
Thanks Agner and Hubert. That's good to know about constants – my intuition there was wrong I guess.

Agner, why 32 registers? That's an old norm, and thus seems somewhat arbitrary. Is there good empirical research on optimal register count for general purpose modern and foreseeable computing? All I've seen is stuff like this, focused on embedded systems: arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf By their measures, 80 registers looks good, but it doesn't make a big difference. I don't know if their method is valid, though, since it's not my field. 32 registers seems a bit low in your case since they serve as unified integer/float registers.

Papers I've read recently that you might find interesting: I like that you're proposing real progress in ISAs. I've been so disappointed in the laziness and lack of innovation in the industry with respect to instruction set architectures, operating systems, and systems programming languages. We've been in an x86, POSIX/Windows, and C rut for a very long time.

John Regehr had some interesting thoughts about what instructions we'd *discover* if we started from first principles and generated optimal instruction sets based on some starting assumptions about what humans need computers to do: blog.regehr.org/archives/669

Regarding your variable vector size, I wonder also about variable data bit-length. The classic doubling values of 8, 16, 32, 64, 128-bit, etc. seem arbitrary to me, and I wonder if a careful empirical investigation might tell us to use different sizes. Or perhaps variable bit lengths could be implemented as easily as variable vector sizes (which usually specify only a couple of allowed field lengths). I don't know what the processor hardware engineering implications of this are. We might discover an energy and performance sweet spot of, say, 24-bit integers or 40-bit floats for lots of applications, for example.


Author: Agner Date: 2016-03-10 04:51
Joe Duarte wrote:
why 32 registers?
Thanks for the references. The article by Alipour seems to be about the physical register file used for renaming, not the number of logical registers.

If the compiler can do global register allocation then it can avoid spilling registers to memory when function1 calls function2 by using different registers in the two functions. The more registers we have, the deeper levels of nested function calls can we have without spilling registers to memory. But we should not forget that a typical program uses more than 99% of its execution time in the innermost loop. The innermost loop should not have more than at most 2 - 3 levels of function nesting in a well designed program. Register spilling outside the innermost loop is pretty irrelevant for performance. If we assume that each function uses a handful of registers and that some of these are used for short-lived variables that do not need to be saved across a function call, then the optimal number of registers might be something like 16.

If a typical instruction code has three register fields then we will need to use three more bits of instruction code for each time the number of registers is doubled. With 32 registers and three register fields, we will use 15 bits of the 32-bit code word only for specifying register operands.

The proposed register set includes vector registers of variable size, and the size may grow indefinitely in future implementations. Saving and restoring a vector register with variable size is quite complicated. First, you have to detect the maximum vector size, and then you have to allocate a corresponding space on the stack. It is good to have many vector registers in order to minimize the need for this procedure. Therefore, I think that 32 vector registers is reasonable. It will be difficult to extend the number of register in the future if the need should arise, so it is better to settle for too many than too few.

The number of scalar registers may be the same because scalar and vector instructions should be coded in the same way, according to my proposal.


Author: Hubert Lamontagne Date: 2016-03-11 01:58
Joe Duarte:

32 registers kinda balances the need for smaller instructions and the fact that smaller register files are faster, smaller and soak up less power, with the fact that memory accesses are slow and complex. With register renaming, physical regfiles generally have at least 64 registers (MIPS R10k) if not way more (88*3 on Athlon, many more on hyper-threaded CPUs) so any less than only saves instruction bits. You also see stack-like register windows (SPARC, i960, Am29k) and rotating register files (Itanium) but the general consensus seems to be that this is overdesigned and MIPS does just as well with 32 ordinary registers. 16 registers is almost as good in typical code (see: ARM, x64) but the "cost" of increasing from 16 to 32 is low enough that architectures tend to go with 32.

Extremely wide in-order CPUs (ie VLIWs) might need more registers to keep all the values generated by software pipelining (Itanium illustrates this) but for "mainstream" designs this isn't considered to be a good plan (ie if you want to make a very wide core, you'll probably have to make it out-of-order to make it any faster than 2-instructions-per-cycle anyways).

Also note that it's very common to have different numbers of float and SIMD registers. For instance, ARM has 16 registers, but its FPU has 32 registers (for the Arm A8/A9/A15/etc fpu, shared with SIMD).

2^N variable sizes exist because you want to be able to calculate array memory addresses with a bitshift. If you allow 24bit integers for instance, then your memory calculation becomes [pointer + (index<<1) + index], not so convenient. And DRAM tends to come in multiples of 8 bits or 9 bits (for parity). Some DSP architectures use 24bit, 48bit and other unusual integer sizes.

The idea of idempotent instruction groups is interesting, and somewhat complementary to another different instruction grouping conceptual scheme I'm playing with (grouping chains of dependent instructions so that only the last instruction of the group writes to a register).

---

"We've been in an x86, POSIX/Windows, and C rut for a very long time."

This is for a good reason. The ~4 instruction per cycle out-of-order CPU is pretty hard to beat in terms of practicality and speed, and attempts to beat it face some pretty daunting challenges. Itanium was a valiant effort, but it failed and it just was never really faster than x86.

One big problem is that the L1 data cache will, at best, have 2 read ports and 1 write port, and that typical code often has 30% of memory loads/stores. This means that it's hard to get a speed gain when making a cpu that runs more than about 4 instructions per cycle.

The last DEC Alpha design was going to do 8 instructions per cycle, but it just couldn't do it for typical programs and they had to run multiple threads on the core to be able to keep the pipeline full. Part of the reason why Intel is top-of-the-game now is that they're top-of-the-memory-access-game.

In C++, the program basically specifies the exact order of memory loads/stores, and it takes huge efforts to escape this ordering (compiler alias analysis, out-of-order cpus, weird speculative loads/stores in VLIWs). Multi-threading, SIMD and even GPUs can be viewed as basically mechanisms to make this ordering more flexible.

Higher level languages like Python typically do even more loads/stores/jumps than C++, which makes them even less optimizable (since it's very likely that they are essentially serial, and they let you do crazy tricks that force you to do everything serially). If there's any hope to get a language that's more efficient than C++, I'd say that IMHO it's probably a language that forces a limitation of "absolutely no pointer aliasing" - so probably with no pointers, no references, no side-effects (and probably copy-on-write objects).


Author: Agner Date: 2016-03-11 03:56
Hubert Lamontagne wrote:
You also see stack-like register windows (SPARC, i960, Am29k) and rotating register files (Itanium) but the general consensus seems to be that this is overdesigned and MIPS does just as well with 32 ordinary registers.
A problem with rotating register windows is that the register wheel or stack will overflow when functions are too deeply nested (assuming that you rotate one frame at each function call). You have to keep track of the function nesting level, which may be impossible when you call a DLL that calls another DLL, etc.
If there's any hope to get a language that's more efficient than C++, I'd say that IMHO it's probably a language that forces a limitation of "absolutely no pointer aliasing" - so probably with no pointers, no references, no side-effects (and probably copy-on-write objects).
Is that possible? If you have to copy every array to avoid pointer aliasing, then you lose a lot of efficiency.


Author: anon2718 Date: 2016-03-13 23:13
One thing I don't tend to see is hardware management of rotating register windows. Or, to look it slightly differently, instead of a straight register-based machine or straight stack-based machine, you have a machine where you can freely access the top <k> elements of the stack for some k.

You have a stack in memory. The contents of said stack area in memory is normally undefined (!), except instructions can read / write the top, say, 32 elements. Or whatever the window size is. Call / return adjust the stack - although I am not sure how much the stack should be adjusted. There are advantages and disadvantages to fixed and variable-sized adjustments, as well as the question of if returns must be matched with calls. There may also be instructions to force an explicit and up-to-date read / write of any element of the stack.

To the processor, treat it as a ring-buffer (or potentially tagged-memory, to allow for easier context switching) cache of the top part of the stack. It speculatively saves anything "below" the current window to the stack to make room for calls, and speculatively loads from the stack to keep the buffer full on returns. As usual here, there is the question of how speculative it should be.

Some architectures have a separate instruction cache - this one has a separate stack cache.


Author: Agner Date: 2016-03-14 02:19
anon2718 wrote:
instructions can read / write the top, say, 32 elements. Or whatever the window size is. Call / return adjust the stack
If it is a rotating register window then it will get filled up when function nesting is too deep. If it is an unlimited stack of register windows then it has to be spilled to memory, which would make calls and returns very slow and would require a separate stack for this purpose only.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

A design without a TLB

Post by agner » Thu Nov 02, 2017 9:03 am

Author: Agner Date: 2016-03-11 05:50
I wonder if it is possible to design a microprocessor without a Translation Look-aside Buffer (TLB). A TLB is a cache that is used for virtual address translation. The TLB is quite big and complicated in many modern processors. Some processors even have two TLB levels. It costs a lot of silicon space and a performance loss because of TLB misses.

One of the most compelling reasons for having virtual address translation in current systems is, as I understand it, that you can run multiple instances of the same program. The multiple instances share the same code segment in order to save memory, but they cannot share writeable data segments. The code segment contains references to the data segment. If multiple running instances share the same code segment then they will also share the same data addresses. The only way to keep the data of different running instances separate is to have the same virtual addresses but different physical addresses for each process.

We can avoid most of the costs of the TLB and virtual address translation by having a special register that points to the data segment of the process. The program code should access its data segment through this pointer. Multiple instances of the same program will have different values in their data segment pointer. This allows them to share the same code segment while having different data segments.

Another use of the TLB is to manage the situation where we are out of physical memory or the memory has become too fragmented. The virtual address translation allows the memory segments to be moved or swapped to a disk. This problem can hopefully be reduced. RAM is cheap today. A well-designed application uses typically 1 MB of memory while a state-of-the-art PC today has 16 GB of RAM or more. Nobody runs 16,000 different applications, not even on a server. Unfortunately, some applications are wildly bloated. This should of course be discouraged. The need for byte-code interpreters, just-in-time compilation and other RAM-hungry frameworks will hopefully be reduced when the instruction set is standardized so that the compiled code will be compatible with many platforms.

We may still need some virtual address translation if we run out of RAM or for the sake of making virtual machines, but it will be a coarse-grained translation with one or a few large contiguous blocks of memory for each process, rather than the fine-grained translation of current systems with a large number of fixed-size memory blocks.

To see how this can be implemented, we first need to get an overview of the different kinds of data used in a running program and how they can be addressed. Traditionally, we have the following kinds of data:
  • Program code (TEXT). This is executable and read-only. Can be shared between multiple processes.
  • Read-only program data (CONST). This contains constants and tables used by the program. It may be shared by multiple processes unless it contains pointers that need to be relocated at load time.
  • Static read/write program data (DATA+BSS). This is used for global data and for static data inside functions. It needs multiple instances if multiple processes are running the same code.
  • Stack data (STACK). This is used for non-static data inside functions. This is the most common and most efficient way of storing program data. Each process or thread has its own stack, addressed relative to the stack pointer.
  • Local heap. Used for dynamic memory allocation by an application program
  • Global heap. Used for dynamic memory allocation by the operating system and device drivers.
  • Thread data. Allocated when a thread is created and used for thread-local static data. Rarely used.
Now, I will discuss how each of these types of data are addressed in current systems and how they can be managed in a system without a TLB.
  • Program code. In current systems, program code may contain absolute addresses. These addresses are modified (relocated) by the loader if the code is loaded at a different address than expected by the linker. The relocation is often avoided by using virtual address translation. Multiple programs that do not need to call each other can be loaded at the same virtual address.
    My proposal is to avoid the need for relocation by using relative addresses as much as possible. All addresses within the same code segment are addressed relative to the instruction pointer.
  • Read-only program data. Current systems use either relative or absolute addresses to address read-only data. These addresses often need to be relocated by the loader.
    My proposal is to make the read-only data segment contiguous with the program code segment, and access it with addressing relative to the instruction pointer. This needs relocation at the link stage, but not at the load stage.
    This segment may contain pointers. This is typically needed in switch/case jump tables, virtual function tables, function pointers and data pointers. Current systems often use absolute addresses in these cases, needing relocation by the loader. Some compilers use self-relative pointer tables or pointers relative to an arbitrary reference point (used in 64-bit Windows and Mac OS).
    My proposal for jump tables, virtual function tables and code pointers is to use 32-bit self-relative addresses or addresses relative to the code base.
    Tables of constant pointers to data is a problem because - in order to use relative pointers - we need to know whether the pointer target is in the read-only data or the read/write data segment. Preferably, such a table should be placed in the same segment as its targets, either the read-only data segment or the read/write data segment. This makes it possible to use self-relative pointers. If it contains a mixture of both, then it should be placed in the read/write data segment, and any targets in the read-only data segment should be moved to the read/write data segment.
  • Static read/write program data. Current systems use absolute or relative addresses to access read/write data in the same way as read-only data. If multiple processes are running the same program then there will be one instance of the read/write data segment for each process. The multiple instances will typically share the same virtual address, while having different physical addresses. This requires virtual address translation. We want to get rid of this translation.
    My proposal is to have a dedicated register for pointing to the read/write data segment. All data in the read/write data segment are addressed relative to the value in this register. We may implement a special addressing mode for this or, alternatively just let the application copy the data segment register to a general purpose register which is used as pointer. Read-only data may optionally be stored here rather than in a separate segment.
    The data segment pointer register needs to be saved and restored when one program calls another program. It does not need to be saved when calling a DLL, which I will explain below.
  • Stack. Each process and each thread has its own stack which is addressed by the stack pointer. No problem here.
  • Heap. You get a pointer when allocating data on a heap. Heap data are addressed through this pointer. No problem here.
  • Thread-local data. Current systems may have a "thread environment block" which contains various information about the thread and a pointer to the thread-local data segment. In x86, it is addressed through a special segment register. It also contains information about stack size, exception handler, process environment, etc.
My proposal is to preserve this system. We may need a dedicated register to point to the thread environment block or to a thread-local data segment.

Dynamic link libraries (DLLs). My proposal is to use Windows-style dynamic linking rather than Linux-style shared objects, because the latter have the rarely used feature of symbol interposition which makes everything less efficient (see www.macieira.org/blog/2012/01/sorry-sta ... -on-linux/ )

I propose that a DLL cannot have a per-process read/write data segment (this might not be thread-safe anyway). If a library needs writeable data, for example for some initializations, then there are three possible solutions:
(1) use static linking, (2) use data supplied by the caller through a pointer, or (3) use global heap data allocated at load time with an absolute address relocated by the loader. This data block will be shared by all processes.

The same applies to device drivers. A device driver may need a writeable data segment for a mutex and for storing information about the device. This data area is shared between all processes. My proposal is that this data block is allocated at load time. It is accessed through an absolute 64-bit address which is relocated by the loader. In case the driver later needs more data, for example if there are many network printers using the same driver, then the device driver can allocate additional space on the global heap and store a pointer to this allocated memory in the data segment it got by the loader. If the driver needs per-process data then it will use data space provided by the caller through a pointer.

These methods will make it possible to replace a TLB with its many small memory segments of fixed size by a memory map with a few large memory segments of variable size. Each process has its own little memory map which is cached in the CPU. The memory map should indicate the type of each memory segment, but not necessarily any address translation. Memory segments of the same type should be joined together and made contiguous as far as possible in order to make the memory mapping as simple as possible. The memory map should not distinguish between executable code and read-only data for DLL's. This will make it possible to join all DLLs together into one big memory segment. Any unused space between the DLLs can be filled with an error code. The same goes with device drivers. Each process will have its own memory map listing the memory areas it is allowed to access. This will normally have only one segment of each type: TEXT, CONST, DATA+BSS, STACK, HEAP+THREADDATA). A segment for DLLs and their constant data can be shared between multiple processes. Each process will be able to see DLLs that belong to other processes, but these are read-only and contain no writeable data so I assume that there is no serious security risk here. A process may use static linking instead if it wants to hide which libraries it is using. The reason why I don't want to hide DLLs from processes that are not using them is that this would split the memory into many small pieces, one for each DLL. This would require many entries in the memory map.

It is my goal to keep the memory map small by keeping similar memory blocks together rather than splitting the memory space into many small pieces of different types.

This may cause problems for programming languages with just-in-time compilation. We will discourage systems and script languages that compile a little piece of code at a time. In fact, the justification for just-in-time compiling disappears when the instruction set is standardized. The code can adapt to different processors with different vector sizes at run time. We only need to (re-)compile code in case a new processor version has new advantageous instructions. The compiler/interpreter should preferably compile the code or script all at once. Piecemeal compilation also causes unpredictable response times which is annoying to the user.

I am undecided about how to implement system calls. It could use absolute addresses or a table of pointers in the read-only data segment or in the thread environment block, or a special system call instruction.

Self-modifying code is discouraged. If an application needs to generate executable code then it should preferably make a DLL and load it before executing it.

Many script languages allow self-modifying scripts. Such scripts should preferably be interpreted rather than compiled. If it turns out that there is a serious need for supporting self-modifying code for applications such as compiling self-modifying scripts, compiling user-supplied macros, or debugging applications, then we may decide to support a memory type that allows both write access and execute access. This write/execute memory will be allocated on a special heap dedicated to this purpose only. Access to use this feature must be restricted in order to avoid abuse by hackers.

Memory model:
The system proposed here gives immediate access to up to 8 GB of code (Jumps and calls use a 32-bit signed offset multiplied by the code word size), 2 GB of read-only data, 2 GB of static read/write data, 2 GB of thread-local data, almost unlimited stack size with 2 GB for each stack frame, and almost unlimited heap space. With such a huge address space, we need not support more than one standard memory model. Everything is accessed through pointers with a 32-bit relative offset.

In the rare case that there is more than 2GB distance between read-only data and the code that reads it, we will use a pointer to access it. This pointer can be stored in the thread environment block.

There is no addressing mode for absolute addresses. In the few cases where we need an absolute address (e.g. data for a device driver), we will load the 64-bit address into a register and use this as pointer. The 64-bit address is inserted into the code by the loader.

Problems:
There may be security problems if we are using a global heap. One process may be able to read and modify data belonging to another process. We should probably avoid using a global heap.

What can we do if the local heap or stack of an application overflows? If the heap overflows, we may make an extra heap that is bigger. This requires an extra entry in the memory map. If the stack overflows, then we need to move it to a different physical address and use virtual address translation. This still requires only one entry in the memory map, but with virtual address translation. The cost is that we have to copy the entire contents to a new physical address. The alternative to copying the entire stack contents is fragmented memory at the cost of having more entries in the memory map.

For these reasons, we cannot completely get rid of virtual address translation, but we can still keep the memory map much smaller than the TLBs of current systems.


Author: Hubert Lamontagne Date: 2016-03-11 11:06
Afaik, one of the main functions of the TLB is to assist in heap allocation. For allocations that go through the page allocator (>15kb on OSX for instance), it will simply get enough 4k RAM pages to hold the data (all it needs is a large enough contiguous address range in the program's address space to map them to). Remapping memory pages is needed to keep that system working.

Paging also removes the need for segmentation, which is why it's so popular - as far as I know it's still a net gain in simplicity.

I guess the potential avenues for simplification are:

- By making the pages very large, you could perhaps make the page table small enough to fit in on-chip static RAM, which would make the state machine for loading the TLB simpler. Probably not worth the trouble but still an interesting concept.

- You could debate removing page fault exceptions. This would keep the "heap memory management" aspect of paging (and probably the "security" aspect by mapping unauthorized accesses to a dummy page), but would make it impossible to implement virtual memory (aka disk swapping) and other similar tricks (file memory mapping etc). The benefit is that instructions following a load/store are no longer speculative, which could probably be beneficial on some semi-out-of-order architectures (for instance, having load addresses generated on the FPU with an FPU running super late in the pipeline).


Author: Agner Date: 2016-03-11 12:32
Heap and stack overflow is indeed a problem as I have written above.

If we make an effort to minimize fragmentation, then we can still use a memory map with a few variable-size memory blocks instead of a TLB with a high number of small fixed-size blocks. Modern TLBs are very complicated with multi-level lookup. I am sure there are possibilities for simplification.


Author: Agner Date: 2016-03-12 00:45
One more suggestion for reducing memory fragmentation. The operating system could make statistics over how much stack and heap space each application uses. Allocate as much space as the statistics predicts + a little more when an application is started. The first time the program runs, it will use the stack size and heap size specified in the executable file header.

I don't think this is a serious burden to put on the operating system, compared to the complicated work of maintaining the large and complicated multi-level tables required by contemporary systems.


Author: Bigos Date: 2016-03-13 07:35
Hi.

There is another way to reduce the TLB cost, which is used by the Mill architecture [1].

The TLB can be moved from the critical path of L1 cache read to DRAM read. Since DRAM reads are already slow, the TLB doesn't have to be fast, which simplifies it's design. However it means that all data on-chip are virtually addressed. Similarly to your proposal, all processes live in a single virtual address space, but the virtual/physical translation is retained.

The security problem is solved by using a PLB (Protection Lookaside Buffer) which is placed where TLB currently is. Since protection data is only needed to occasionally trigger an exception, it's not on a critical path of L1 read. Mill also employs so called well known regions, which are similar to per-thread/per-process segments and reduce the need to use the PLB in most cases.

Since many operating systems implement a memory mapping commands like linux's mmap, removing the virtual to physical translation would make it very difficult to port such OSs and its applications.

[1] http://millcomputing.com/docs/memory/ (circa 60th minute)


Author: Agner Date: 2016-03-28 05:13
Ideas for preventing stack overflow:

In most cases, it is possible to calculate exactly how much stack space an application needs. The compiler knows how much stack space it has allocated in each function. We only have to make the compiler save this information. This can be accomplished in the following way. If a function A calls a function B then we want the compiler to save information about the difference between the value of the stack pointer when A is called and the stack pointer when B is called. These values can then be summed up for the whole chain of nested function calls. If function A can call both function B and function C then each branch of the call tree is analyzed and the value for the branch that uses most stack space is used. If function A is compiled separately into its own object file, then the information must be stored in the object file.

The amount of stack space that a function uses will depend on the maximum vector length if full vectors are saved on the stack. All values for required stack space are linear functions of the vector length: Stack_frame_size = Constant + Factor * Max_vector_length. Thus, there are two values to save for each function and branch: Constant and Factor. We need separate calculations for each thread and possibly also information about the number of threads.

The linker will add up all this information and store it in the header of the executable file. The maximum vector length is known when the program is loaded, so the loader can finish the calculations and allocate a stack of the calculated size before the program is loaded. This will prevent stack overflow and fragmentation of the stack memory. We may also store information about how many threads the program will create. Some programs will use as many threads as there are CPU cores, for optimal performance. It is not essential, though, to know how many threads will be created because each stack can be placed anywhere in memory, but it will make the memory map simpler if all thread stacks can be kept together

In theory, it is possible to avoid the need for virtual address translation if the following four conditions are met:
  • The required stack size can be predicted and sufficient stack space is allocated when a program is loaded and when additional threads are created.
  • Static variables are addressed relative to the data section pointer. Multiple running instances of the same program have different values in the data section pointer.
  • The heap manager can handle fragmented physical memory in case of heap overflow.
  • There is sufficient memory so that no application needs to be swapped to a hard disk.
Before we rely on this mechanism, we should discuss what can possibly go wrong. Things that can cause problems are:
  • Recursive functions can use unlimited stack space. We may require that the programmer specifies a maximum recursion level in a pragma.
  • Allocation of variable-size arrays on the stack using the alloca function in C. We may require that the programmer specifies a maximum size.
  • Run-time dynamic linking. Dynamic link libraries (DLLs) are usually linked at load time and the loader will be able to include these in the calculation of stack requirements. But a program can need to load and call a DLL at run-time if the choice of DLL depends on user input or if the DLL is called from a script. We may need to guess the required stack size, perhaps based on statistics.
  • Lazy loading. A large program may have certain code units that are rarely used and loaded only when needed. Lazy loading can be useful to save memory, but it may require virtual memory translation and it may cause memory fragmentation. A straightforward solution is to implement such code units as separate executable programs, but this can complicate the exchange of data between mother program and subunits.
  • Script interpreters. Some programming languages are implemented as scripts which are interpreted at run-time rather than compiled. We cannot calculate the required stack size in advance for interpreted scripts. Obviously, it will be more efficient to compile the script if a compiler is available. Self-modifying scripts cannot be compiled.
  • User-defined macros. Macros are similar to small scripts. Depending on the implementation, macros may use heap space or stack space or both, but usually the memory requirement is limited.
  • Many programs running. The memory can become fragmented when many programs of different sizes are loaded and unloaded randomly.
A possible alternative to calculating the stack space is to measure the actual stack use the first time a program is run, and then rely on statistics to predict the stack use in subsequent runs. The same method can be used for heap space. This method is simpler, but less reliable. The calculation of stack requirements based on the compiler is sure to cover all branches of a program, while a statistical method will only include branches that have actually been used.

We may implement a hardware register that measures the stack use. This stack_measurement register is updated every time the stack grows. We can reset this stack_measurement register when a program starts and read it when the program finishes. We don't need a hardware register to measure heap size. This information can be retrieved from the heap manager.

These proposals can eliminate or reduce memory fragmentation in many cases so that we only need a relatively small memory map which can be stored in the CPU chip (Each process will have its own memory map). However, we cannot completely eliminate memory fragmentation and the need for virtual memory translation because of the complications discussed above.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Proposal now published

Post by agner » Thu Nov 02, 2017 9:15 am

Author: Agner Date: 2016-03-22 10:51
Thank you everybody for all your inspiring comments to my "Proposal for an ideal extensible instruction set". I have now worked everything together and made a more detailed proposal. It is published at http://www.agner.org/optimize/instructionset.pdf

I have designed a consistent code structure where everything fits nicely. All instruction forms and addressing modes fit into the same template. All immediate constants and address offsets have power-of-2 sizes and proper alignment. The code word size is 32 bits, and each instruction can use one, two or three words. Each instruction can be coded in many different versions with different operand types, addressing modes, options and features. Simple common instructions can be packed in a tiny format with two tiny instructions stuffed into one 32-bit code word, but the 4-byte alignment of the code is maintained.

The idea of variable-length vector registers fits excellently with the design goals. The same executable program can run optimally on different microprocessors from small office computers and tablets to large scientific supercomputers with very long vector registers, without the need for separate compilation for each platform.

The instruction set has no name yet. I have considered calling it CRISC, because it combines the best from RISC and CISC. The modular format with easy detection of instruction length makes decoding simple and fast. Instructions have a moderate degree of complexity. An instruction can do multiple things, but only if it fits into the pipeline structure so that it does one thing at each pipeline stage. This will assure a throughput of one instruction per clock cycle per pipeline lane (except for division and cache misses). There is no need to split instructions into micro-operations or to use microcode. My ambition is to design a system that can outperform the best existing microprocessor designs.

My proposal includes standardization of the entire ecosystem of ABI standard, binary file format, function libraries, compiler support and OS support. With open standards for the entire ecosystem we would be able to combine different programming languages in the same program and use the same function libraries with all compilers.

Isn't that a nice vision? I am looking forward to your intelligent comments.


Author: Hubert Lamontagne Date: 2016-03-23 20:03
Really nice doc! A question and a couple suggestions:

I kinda wonder how the OS would handle the following case:
- Load program 1 to offset 0010 0000h
- Load program 2 to offset 0020 0000h
- Load program 3 to offset 0030 0000h
- The user unexpectedly loads a 300mb file in program 2, which causes a surprise 300mb allocation on the heap. Where does the OS place this allocation?

--------

One suggestion: remove "Indexing into the register file".

Rationale:

The implementation of register file indexing on an out-of-order CPU looks like this:

- Option #1: Stall hard at the register rename stage until the indexing value becomes valid and can be read. Then read the value through a special feedback path from the register file/bypass network/data cache up to the renamer. If the index is loaded from memory, this will stall for at least 3 or 4 cycles even in the best case. Resume execution.

- Option #2: Speculatively assume that the indexed register will not be the same as any other subsequent operation. Record the previous value of every single rename in a special queue or register file after this until the indexing value can be read (through the special feedback path), or alternatively backup the whole renaming register file. When the indexing resolves, check that no reads/writes have been done to the resolved register. If there are any reads, trigger a branch prediction fail and restore the CPU state to the last known valid state.

- Option #3: Have a specialized predictor that remembers which indexed register reads/writes cause branch prediction fails. Use option #2 by default, except for indexed-register-reads/writes that trigger fails which use option #1.

Even on in-order CPUs, this is a problem because this defeats the bypass network for register reads - you can't bypass a register that you can't figure out what it's going to be yet! Having to stall multiple cycles to avoid potential hazards make register-file indexing very costly, which defeats the purpose.

--------

2nd suggestion: Make operand size for tiny format 32bits, not 64bits (except mov). Consider trading tiny setbit/clearbit/xor/reversesubtract/andnot for arithmetic shift right and 64bit add/sub. Consider adding the option for 32bit signed indexes in address calculations or sign extending instead of zero extending 32bit operations to 64bit by default.

Rationale:

32bit integer operations are extremely extremely common in C/C++ code, due to how the LLP64 (win64) and LP64 (OSX+Linux) models work. Almost only pointers will be 64bits in a typical program (and the remaining cases are when size_t is used instead of int for loop indexes), and typically the only operation done on pointers are add, sub and compare. This also affects indexing - some_pointer[signed_int32_offset] tends to show up very often in code, and you probably don't want the compilers to add a sign extension operation every time.


Author: Agner Date: 2016-03-24 01:46
Hubert Lamontagne wrote:
The user unexpectedly loads a 300mb file in program 2, which causes a surprise 300mb allocation on the heap. Where does the OS place this allocation?
The heap doesn't have to contiguous - the stack does. In case the heap overflows, there are three options:
  • Allocate more heap space somewhere else. Make an extra entry in the memory map for it.
  • Same as 1. Use virtual address translation to keep it contiguous in order to make the heap manager simpler. This requires that you allocate a lot of unused virtual address space for each program as it starts. This method works also for the stack.
  • As a last resort, when memory space has become hopelessly fragmented and you are out of memory map entries. Swap the data of the least used program to disk. Reorganize the fragmented data by actually moving the values to make them contiguous in physical memory (assuming that they were already contiguous in virtual address space). This will cause an annoying delay to the user, but we already have such delays in current systems when you run out of memory. You may signal a warning to the user: Memory low, please close some programs.
One suggestion: remove "Indexing into the register file".
Thank you for pointing out the difficulties here. Several people have proposed things like a reconfigurable register space with arbitrary allocation of vectors in this space, and I thought that a register index was simpler.

I want to avoid complex instructions for saving and restoring all or many registers. Any suggestions for an alternative to writing 64 consecutive save instructions or having a complex microcoded save-all instruction?
2nd suggestion: Make operand size for tiny format 32bits, not 64bits (except mov). Consider trading tiny setbit/clearbit/xor/reversesubtract/andnot for arithmetic shift right and 64bit add/sub. Consider adding the option for 32bit signed indexes in address calculations or sign extending instead of zero extending 32bit operations to 64bit by default.
I am not sure if you are trying to save power by doing a 32-bit addition instead of 64 bits. The hardware may actually disable the upper part of the heavy carry-lookahead circuit if it can detect quickly that the values are small. Otherwise, I don't see the problem of using 64-bit addition on 32-bit values. The value will simply be truncated if it is later used in a 32-bit (not-tiny) instruction. The compiler will recognize the need to sign-extend a 32-bit signed index or optimize the program by replacing it with a 64-bit index variable. Or you may use an unsigned 32-bit index and use the carry flag for detecting end of loop if the index is counting down. I would rather keep array indexes 64-bit because setting a 2 GB size limit to arrays is a problem for the programming language standard, and the value may cross zero when a count-down loop ends.


Author: Hubert Lamontagne Date: 2016-03-24 12:20
1. Allocate more heap space somewhere else. Make an extra entry in the memory map for it.
That would work, although then the memory map would potentially need to be somewhat complex and probably have
to be cached - this is almost a TLB already, the only missing feature is the ability to remap memory blocks.
2. Same as 1. Use virtual address translation to keep it contiguous in order to make the heap manager simpler.
This requires that you allocate a lot of unused virtual address space for each program as it starts. This method
works also for the stack.
This is exactly what TLBs do!
3. As a last resort, when memory space has become hopelessly fragmented and you are out of memory map entries.
Swap the data of the least used program to disk. Reorganize the fragmented data by actually moving the values to
make them contiguous in physical memory (assuming that they were already contiguous in virtual address space).
This will cause an annoying delay to the user, but we already have such delays in current systems when you run out
of memory. You may signal a warning to the user: Memory low, please close some programs.
Amazingly, this approach exists on real hardware: pre-OSX Mac OS works this way. It has a memory compactor and
all memory allocations must use memory handles instead of pointers. When RAM runs out, the OS recopies the whole
RAM and removes all dead allocation and updates all memory handle addresses. The reason for this is that the first
Macs had no MMU (!) so this was the only approach that could work, but this created long lasting damage to the
platform: there was a whole system for locking down handles (when doing system calls and so forth), and the whole
thing was only fixed with OSX. 16 bit Windows also has this issue (they figured out a way to fix it in Win32).

Java also typically has memory compaction and doesn't require an MMU, but it pays a price for it: it has unavoidable
garbage collector pauses, often taking upwards of 100ms (which is one of the reasons why Minecraft is a bit jerky).
Then again, Java is typically used for server software and enterprise/government stuff, where pauses don't really matter
as much as dealing with second rate programmers that can't use malloc()/free() properly.

------
I want to avoid complex instructions for saving and restoring all or many registers. Any suggestions for an alternative to
writing 64 consecutive save instructions or having a complex microcoded save-all instruction?
For general purpose registers, you're going to do 32 consecutive loads/stores so it's always going to take 'some' time.
I'd suggest mandating 16-byte stack alignment in the ABI, and having store-dual and load-dual instructions that can
save/restore two registers in a single whole 128bit memory access. This is what ARM64 does. ARM already has to deal
with other instructions with 3 sources or 2 destinations anyways.

Other RISCs simply use individual instructions for each register, because normally you do register saving/restoring
when doing branches and interrupts and other moderately slow things, so the pipeline is likely to stall for all sorts of
other reasons (branch prediction miss, data cache miss, instruction cache miss, chains of indirect loads, having all
downstream operations depend on a single load so that the whole pipeline is latency-limited, etc), so the cost of
saving/restoring the regfile is lost in the wash.

For vector register save/restore, you're dealing with large vector registers which is going to take multiple memory cycles,
so it almost doesn't matter how you do it! The cost of keeping the data cache busy for multiple cycles for each vector
register load/store completely dwarfs the cost of issuing 32 loads/stores instead of less. It's also possible to have
instructions that load/store to multiple contiguous SIMD registers - Arm NEON has tons of this, including strange stuff
like interleaved loads.

------
I am not sure if you are trying to save power by doing a 32-bit addition instead of 64 bits. The hardware may actually
disable the upper part of the heavy carry-lookahead circuit if it can detect quickly that the values are small. Otherwise,
I don't see the problem of using 64-bit addition on 32-bit values. The value will simply be truncated if it is later used
in a 32-bit (not-tiny) instruction. The compiler will recognize the need to sign-extend a 32-bit signed index or optimize
the program by replacing it with a 64-bit index variable. Or you may use an unsigned 32-bit index and use the carry
flag for detecting end of loop if the index is counting down. I would rather keep array indexes 64-bit because setting
a 2 GB size limit to arrays is a problem for the programming language standard, and the value may cross zero when
a count-down loop ends.
I'm not suggesting it for power-saving reasons, I'm suggesting it for C++ compatibility reasons!

You're right: 64bit MOV, ADD, SUB, AND, OR, XOR, SHL can stand in for 32bit versions. This does not work for SHR,
arithmetic SAR and comparisons though. MUL is a bit of a wash - 64bit MUL can stand in for 32bit MUL, BUT 64bit
MUL is kinda slow to implement in hardware due to the large number of bits involved and large number of partial sums,
so a 32bit MUL still makes sense. This is why ARM ended up with a 16x16->32 MUL in its instruction set - it's strictly
there because it can run faster than 32x32->32 MUL.

The compiler will probably have to add an instruction to sign-extend all 32bit indexes. It cannot replace 32bit variables
with 64bit variables because that changes behavior - instead of 'int' being "32bits", it becomes "32bits except if the
compiler decided to make it 64bits but it can still revert to 32bits at any time depending on if the value is in a register
or in RAM". Technically you could make int 64bit, and some early 64bit cpus did this, but this is bad because it eats
twice as much data cache for no good reason.

IRL, >2GB arrays are as rare as hens teeth and if they do happen, they probably already need complex management
for other reasons. This is not really due to constraints of current memory sizes... it's more of a result of the fact that
you practically never need "2 billion of something".


Author: Agner Date: 2016-03-24 14:50
Thanks for your comments. I know that I can't completely get rid of memory fragmentation and virtual address translation. I am just hoping to keep the number of memory sections and fragments so small that we can get rid of the fixed size memory pages and have a limited number of variable size memory blocks instead.

The instruction set makes sure that the data segment can be placed anywhere. It doesn't have to be adjacent to the code segment. Another thing is getting rid of the many DLLs or shared objects with each their code and data segment, but instead joining all DLL code segments into one.

Hubert Lamontagne wrote:
Any suggestions for an alternative to writing 64 consecutive save instructions or having a complex microcoded save-all instruction?
For general purpose registers, you're going to do 32 consecutive loads/stores so it's always going to take 'some' time.
I'd suggest mandating 16-byte stack alignment in the ABI, and having store-dual and load-dual instructions that can save/restore two registers in a single whole 128bit memory access.
I did consider 16-byte stack alignment for the sake of better alignment of vectors. But it will waste stack space on every call. Double push/pop instructions would have to be optional because I am trying to keep complexity down.

One solution is to have "save register and increment pointer" instructions in tiny form for both integer registers and full length vectors. Same for restore. Then you can save everything with 64 tiny instructions = 128 bytes of code.

About 32-bit index.
The compiler will probably have to add an instruction to sign-extend all 32bit indexes. It cannot replace 32bit variables with 64bit variables because that changes behavior - instead of 'int' being "32bits", it becomes "32bits except if the compiler decided to make it 64bits but it can still revert to 32bits at any time depending on if the value is in a register.
I just made an experiment to see what compilers actually do with a signed 32-bit index variable in 64-bit mode (x86-64). Both Ms and Gnu compilers simply replaced my 32-bit integer with a 64-bit integer. I don't think it is worth the effort to make a special addressing mode with sign-extended 32-bit index.


Author: Hubert Lamontagne Date: 2016-03-24 18:45
I am just hoping to keep the number of memory sections and fragments so small that we can get rid of the fixed size
memory pages and have a limited number of variable size memory blocks instead.
I'm really curious about how the hardware would be able to do that mapping in a very short time (ideally in a single
cycle, and in a way that the L1 cache data lines that get read don't depend on virtual address translation).

------

| Double push/pop instructions would have to be optional because I am trying to keep complexity down.

I feel that your proposal isn't that aggressive in terms of keeping complexity down (compared to a MIPS). It has many
other parts that would be more complex or would require microcode, so I'm not sure that the extra complexity in that
one place would be that salient.
One solution is to have "save register and increment pointer" instructions in tiny form for both integer registers and full
length vectors. Same for restore. Then you can save everything with 64 tiny instructions = 128 bytes of code.
Well, "load register and increment pointer" writes to 2 registers so it's similar in complexity to "load dual". Squeezing
a few very common types of memory loads in tiny instruction still makes sense though.

------
I just made an experiment to see what compilers actually do with a signed 32-bit index variable in 64-bit mode (x86-64).
Both Ms and Gnu compilers simply replaced my 32-bit integer with a 64-bit integer. I don't think it is worth the effort to
make a special addressing mode with sign-extended 32-bit index.
I just checked msvc's x64 output, and I'm admit I'm very spooked. It does stuff like loading an index into eax, then using
rax as an array index, even though the variable is clearly signed in the code. Some code locations seem to use movsx for
sign extension, but most don't. The compiler seems to guess when the variable can never go negative, somehow. In one
place, it seems to convert from eax to rax near the beginning of a function, then use rax throughout the body of the
function. Maybe it's using the part of the C/C++ standard that says that integer overflow is "undefined".

This also affects Java (which strongly defines "int" as 32bits and signed, and all its operations as 32bit signed operations)
and there is at least one paper about figuring out how to remove as many sign extensions as possible in 64bit mode.


Author: Agner Date: 2016-03-25 05:04
Hubert Lamontagne wrote:
I'm really curious about how the hardware would be able to do that mapping in a very short time (ideally in a single
cycle, and in a way that the L1 cache data lines that get read don't depend on virtual address translation).
The virtual address translation is just an adder in the memory map that I am envisaging, rather than the multi-level table lookup of a traditional TLB. You may put virtual address translation after the L1 cache to make cache access faster.

The memory map is saved and restored on a task switch since there will be a separate memory map for each process.
It has many other parts that would be more complex or would require microcode.
I hope not. I would rather have more complexity in the pipeline and perhaps dedicated state machines to things like interrupts and system calls rather than using microcode. Microcode seems to be incredibly slow in the processors I have tested, though I don't know exactly why.
Well, "load register and increment pointer" writes to 2 registers so it's similar in complexity to "load dual".
Pop dual will write 3 registers, including the stack pointer. I think a fixed limit of two output registers is fair. We need that for flags output anyway.
I just checked msvc's x64 output, and I'm admit I'm very spooked. It does stuff like loading an index into eax, then using rax as an array index, even though the variable is clearly signed in the code. Some code locations seem to use movsx for sign extension, but most don't.
In my experiment, the MS compiler sign-extended the index outside a loop, the Gnu compiler used zero-extension. Gcc is (in-)famous for interpreting standards in a very pedantic way. There is probably some C standard saying that a negative index to a pointer or array is undefined.


Author: Hubert Lamontagne Date: 2016-03-28 19:20
The multi level tlb looks complex, but it has fewer corner cases than having an extra adder in the address translation : you can use a physically indexed tlb in parallel with cache lookup is the same size as your cache line way - this is why CPUs with 2-way 8kb L1 (4kb per way) use 4kb MMU pages on x86!

For task switching, it's probably easier to change the master page table address (CR3 on x86) and clear the tlb than to do a real task switch. Or alternatively, you can have a tlb tagged per process and then you just need to change the page table base address + process ID register!

The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction. They even have a software tlb! That's also why it has the separate multiply result register - not for the high/low 32x32->64 thing but rather to avoid dealing with the really long latency result.

This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.

Push/pop has the disadvantage of reupdating the stack pointer every time, instead of once for the whole group of loads/stores (the MIPS way). So I was suggesting dual load/store, not dual push/pop.

I think the voodoo GCC uses is that c++ specs say that overflowing integers are "undefined" - if you promote your int32s to int64 but never overflow them the result is the same!


Author: Agner Date: 2016-03-29 02:11
Hubert Lamontagne wrote:
The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction.
This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.
Why is it so much more difficult to have two-result instructions? We need two-result instructions for: add-with-carry, overflow detection, pop register, return, read-and-increment-pointer, ALU-and-conditional-jump.
If we want an efficient implementation of add-with-carry with a single register, we may have an extra bit in the register for this purpose. Or we may use every second element in a vector for carry bit, at the cost of getting half the work done. The same for overflow detection.

Function calling also becomes complicated if we do not allow multiple results. The call of a non-leaf function will be: (1) copy instruction pointer to link register, (2) jump to target, (3) subtract from SP, (4) save link register on stack. And the return will be: (5) recall link register from stack, (6) add to SP, (7) jump to link register. That is seven instructions instead of two. Does the increase in speed really make up for that?


Author: Hubert Lamontagne Date: 2016-03-30 02:46
Agner wrote:
Hubert Lamontagne wrote:
The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction.
This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.
Why is it so much more difficult to have two-result instructions?
I admit, it's not that much more difficult to have two-result instructions. But it has a cost: a 4-issue CPU ends
up potentially writing to 8 registers per cycle. This means that you need a register file with 8 write-ports, which
takes up more space and has a higher latency. Your register renamer also needs 8 write-ports instead of 4,
and the potential number of conflict scenarios that have to be broken down at the issue stage goes up. If you
have a pentium-pro style pipeline where results are committed to a permanent register file in-order, this also
goes up from 4 write ports to 8. If you have an R10000 style pipeline where the permanent register file is only
for renaming, then you have a different but similar problem: you have to queue 8 now-reusable registers to
the register renamer per cycle instead of 4.

You could also limit the number of instructions you issue if they take up too many write ports - for instance, if
you have 6 write ports, you can check on each cycle if you're going to use up too many write ports and only
issue 3 instructions to prevent the 7-or-8-writes-on-a-single-cycle scenario described above. But then you
need more arbitration circuitry and your potential benefit from multi-result instructions goes down.

I guess it all comes down to what's your limiting factor:
  • If your limiting factor is issue width (how many different instructions per cycle you can issue), then multi-result
    instructions are good because you're doing more work out of your few available instructions. x86 tends to
    fall in this case due to the whole instruction length business, and good x86 designs tend to do lots of work
    from few instructions (AMD's 3-issue Athlon is a perfect example of this - and a good example of "fast CISC").
  • If your limiting factor is register-file and rename ports, then multi-result instructions are bad because they
    won't be faster than multiple instruction sequences and they make the pipeline more complex overall (since
    the multiple results have to be committed together and so forth).
  • If your limiting factor is L1 cache read and write ports, then it's all a wash.
We need two-result instructions for: add-with-carry,
I don't think ADC is used often enough to warrant being included in a general purpose instruction set. It makes
perfect sense for 8bit and 16bit processors, but for 32bit processors, you rarely - if ever - do 64bit calculations.
This counts double for 64bit processors: ADC only ever appears if you want to do 128bit calculations (or larger),
which is even less common, and it only saves 2 instructions and 1 cycle latency over the equivalent MIPS sequence
(using compare-and-set-to-1-if-larger-or-equal to generate carry and adding it in separately).
overflow detection,
Careful use of comparison instructions handle this case adequately, as far as I can tell. For instance, when doing
unsigned addition, you only have to compare the result with a source operand: if the result is smaller, you have a
100% guaranteed wrap. Or, since most integer operations happen on 32bit ints, you can generate an oversized
64bit result and check for overflow separately. Furthermore, these overflow checks happen outside of the critical
path, so unless the instruction stream is saturating the CPU's ALUs, these checks are basically free.

The other option is to generate trap interrupts on overflows, but this generally can't be used in high-level language
interpreters (too coarse grained, hard to recover from an interrupt), or in C++ (no support, programmers expect
int32_t calculations to wrap), and it's not particularly useful in ASM either (more speed-oriented than security-oriented).
pop register,
Ok, this one is actually fairly common in general purpose code. The MIPS equivalent is pretty okay too though: load
register + increment sp. This is especially OK if you pop multiple registers at the same time, in which case all the
sp increments can be combined together into one large increment at the end and the CPU doesn't have to keep
track of the intermediate values of sp. It generates a sequence like this:

Code: Select all

ld r4 [sp + #0],
ld r5 [sp + #4],
ld r6 [sp + #8],
ld r7 [sp + #12],
add sp #16
On out-of-order CPUs, this might run faster than 4 pop instructions because it typically generates 5 micro-ops
instead of 8. Other times, the execution speed of that sequence is limited by L1 data cache ports so there's no
speed difference.

If you only have to pop a single value, then the splitting hurts more, but then your function is likely to be inlinable or
a leaf-function.
return
Return falls into more or less the same case as pop register: it's fairly common, enough to warrant special
consideration, but the MIPS sequence (ld r31 [sp + #x], add sp #4, jmp r31) also handles it well (since
loading/storing the link register can be combined with the rest of the loads/stores to stack so it often
doesn't generate any extra sp updates). Doing it with a single complex instruction rather than multiple simple
ones often doesn't generate much win since the number of overall memory operations and state changes
doesn't change ('return' afaik is always a 2 or 3 micro-op instruction on x86).
read-and-increment-pointer,
That case is similar to pop register: on one hand, you get two results for the price of one if your front-end can
only generate a limited number of instructions (which is why ARM has this instruction), but on the other hand,
separating reading and pointer-increment lets you combine a whole bunch of updates to the same pointer
together, which is often good since it reduces the number of intermediary results for the pointer value (and
removes the false dependency between multiple consecutive reads to the same incremented pointer).

If you look at later ARM cpus, read-and-increment instructions often have speed penalities since the
underlying CPU can't really handle the extra generated values (too few register write ports etc) so there's
often very little gain over using separate load and increment instructions.
ALU-and-conditional-jump.
That one is more interesting because it's not really a multiple-result instruction... The ALU result and jump
go to different parts of the retirement unit (regfile writeback and branching respectively), which is why
combined compare+branch appear in many RISC instruction sets (including MIPS).
If we want an efficient implementation of add-with-carry with a single register, we may have an extra bit in the register for this purpose. Or we may use every second element in a vector for carry bit, at the cost of getting half the work done. The same for overflow detection.

Function calling also becomes complicated if we do not allow multiple results. The call of a non-leaf function will be: (1) copy instruction pointer to link register, (2) jump to target, (3) subtract from SP, (4) save link register on stack. And the return will be: (5) recall link register from stack, (6) add to SP, (7) jump to link register. That is seven instructions instead of two. Does the increase in speed really make up for that?
Step (1) and (2) are typically a single instruction (since one retires to the regfile and one retires to the
branch unit), so it's not a problem. Function calls are likely to generate a whole bunch of memory
loads/stores - typically to object member variables in C++ - so it's very likely there will be a free pipeline
ALU slot for the SP updates (3) and (6), and as stated above you only need a single SP update no
matter how many registers you're loading/storing. If a cache miss or branch misprediction happens
(which is probably most likely near function starts and ends), then you'll likely have dozens free ALU
cycles that you can use to deal with SP. Another common case is that function call/returns often have
series of instructions limited by data cache latency (ex: loading a pointer, then using it to read from RAM),
in which case you also have many 'free' ALU cycles.


Author: Agner Date: 2016-03-30 11:31
Hubert Lamontagne wrote:
it all comes down to what's your limiting factor:
- If your limiting factor is issue width (how many different instructions per cycle you can issue), then multi-result
instructions are good because you're doing more work out of your few available instructions. x86 tends to
fall in this case due to the whole instruction length business, and good x86 designs tend to do lots of work
from few instructions (AMD's 3-issue Athlon is a perfect example of this - and a good example of "fast CISC").
- If your limiting factor is register-file and rename ports, then multi-result instructions are bad because they
won't be faster than multiple instruction sequences and they make the pipeline more complex overall (since
the multiple results have to be committed together and so forth).
- If your limiting factor is L1 cache read and write ports, then it's all a wash.
I believe the limiting factor will most likely be cache bandwidth and memory bandwidth. That's why I want moderately complex and compact instructions.
We need two-result instructions for: add-with-carry,
I don't think ADC is used often enough to warrant being included in a general purpose instruction set.
add-with-carry is used mainly for high-precision math. This is typically big number-crunching algorithms where performance is critical.
overflow detection,
Careful use of comparison instructions handle this case adequately, as far as I can tell. For instance, when doing unsigned addition, you only have to compare the result with a source operand.
Overflow detection with signed integers is a mess.
pop register,
Ok, this one is actually fairly common in general purpose code.
Well, I have changed my opinion on this one :-)
Push and pop will rarely be used in critical parts of the code if my ABI proposals are met. We don't need push for function parameters because we have assigned 16 integer registers and 16 FP/vector registers for function parameters. We could even assign more registers, if needed. Push and pop for register spilling can be kept out of the critical innermost loops and functions if we implement my idea that the register use of functions should be reported in object files.

The problem with push and pop is that a push or pop that is waiting for an operand can delay all subsequent stack operations unless the instruction is split into micro-operations or a special stack prediction mechanism is implemented. So I think that we don't need push and pop instructions at all, but I still want to have call and return instructions that use the stack (for the reasons I have explained in my document). A direct call and return cannot delay subsequent stack operations, but an indirect call can, if it is not split into micro-operations.
read-and-increment-pointer,
That case is similar to pop register: on one hand, you get two results for the price of one if your front-end can
only generate a limited number of instructions (which is why ARM has this instruction), but on the other hand,
separating reading and pointer-increment lets you combine a whole bunch of updates to the same pointer
together, which is often good since it reduces the number of intermediary results for the pointer value
The only reason I want read-and-increment-pointer is to make it easy to save all registers. A "read-and-increment-pointer" can fit into a tiny instruction, while you will need a doubleword (64 bit) size instruction to save or restore integer registers with a non-moving pointer, and you will need two instructions to save or restore each variable-size vector register. So this is for compactness, not for speed. It may be bad for the speed if a compiler uses the read-and-increment-pointer instruction for ordinary loops, unless the instruction is split into micro-operations. Other forms of loop will be faster than read-and-increment-pointer.

Locked