The original proposal, 2 years ago

Post by **agner** » 2017-11-02, 14:06:00

Author: Joe Duarte Date: 2017-04-13 20:12
FYI, Intel has an upcoming solution for protecting return addresses: https://software.intel.com/sites/defaul ... review.pdf

It would be interesting to compare ForwardCom's approach to CET.

Author: Agner Date: 2017-04-14 00:53
Joe Duarte wrote:

Intel has an upcoming solution for protecting return addresses: https://software.intel.com/sites/defaul ... review.pdf

It would be interesting to compare ForwardCom's approach to CET.

Thanks for the info. Intel's solution is quite complicated because it has to be compatible with legacy code and legacy processors. ForwardCom's solution is quite simple: A separate call stack which stays in the CPU most of the time, and jump tables placed in read-only memory. This just shows the difference between patching an insecure system and designing a system to be secure from the beginning.

Post by **agner** » 2017-11-02, 14:07:06

Author: Agner Date: 2017-07-27 02:06
I am working on making an assembler for ForwardCom, and I think that the design of assemblers could use an update, now that we are redesigning everything anyway. I found that it was not too hard to make the assembly syntax look very much like C or Java, rather than the terrible syntaxes of many current assemblers. This will surely make it easier to write assembly code and make the code easier to read. My plans for the assembler so far are as follows:

Instructions look like C code. For example to add two registers:

Code: Select all

int32 r1 = add(r2, r3)

Or simply:

int32 r1 = r2 + r3

Memory operands are indicated with square brackets, where the length of vector registers is indicated if necessary:

Code: Select all

float v1 = [r1 + 0x100, length = r5] // read memory operand of r5 bytes into vector register v1

Masked instructions can be coded conveniently with the ?: operator:

Code: Select all

int32 r1 = r4 ? r2 + r3 : r5 // add r2 and r3 with boolean in r4 as mask and r5 as fallback value

This is still assembly, because you make one machine instruction per line, and the programmer has to decide which registers to use, but it looks like high-level language.

Branches and loops can be coded either with conditional jumps or as high-level constructs like if, else, switch, for, while, do, break, continue, with {} for block nesting.

I also intend to support preprocessing directives like #define, #include, #if; #else, #endif.

Now, a more tricky problem is macros and metaprogramming. Most assemblers have macro features that allow metaprogramming, but the syntax is often complicated, confusing, and inconsistent. Such features are probably needed, but I would like a more logical syntax. I have looked at high-level languages to see if they have useful metaprogramming features. Metaprogramming in C++ is possible with templates, but this is very convoluted because the template feature was designed for a more narrow purpose. I would prefer something more similar to the support for self-modifying code in script languages. You make a string variable and then issue this string to the assembler and have it interpreted as assembly language.

We need a way to distinguish between code that is generated for run-time execution, and meta-code that is executed in the assembler. My idea so far is to use a dot (.) or percent (%) or some other symbol to indicate assembly-time meta code. For example, a loop that adds the numbers from 1 to 10 at runtime will look like this:

Code: Select all

int64 r0 = 0
for (int64 r1 = 1; r1 <= 10; r1++) {
int64 r0 += r1
}

A similar assemble-time loop could look like this:

Code: Select all

. sum = 0 // define assembly-time variable sum
.for (i = 0; i <= 10; i++) { // dot indicates assembly-time loop
. sum += i
}
int64 r1 = sum // runtime code r1 = 55

This will calculate the sum at assembly time and just generate a single runtime instruction setting register r1 = 55.

The assembly-time variables will be dynamically typed, using 64-bit integers, double, or string, depending on the context. All common operators are supported. String expressions can be converted to assembly code. For example:

Code: Select all

. text = " nop \n"
. emit text + text /* This will send the string " nop \n nop \n" to the assembler and produce two nop instructions */

With this proposal, we have three kinds of code. For example, "#if" will branch at the preprocessing stage, ".if" will branch at the assembly metaprogramming stage, and "if" will generate executable code that branches at runtime. It may be confusing with three types of code, but at least the "#" and "." prefixes will make the distinction clear.

The preprocessing directives (with #) are executed before the assembler interprets any code, so it has no knowledge of anything defined in the code, while the metaprogramming features (with .) can check e.g. if a function is defined before making code that calls it, or it can make different code depending on the type or size of a defined data object.

I would like to hear your comments before I implement all this.

Author: Kai Rese Date: 2017-08-11 09:13
This looks like the most convenient and easiest to use assembler I have ever seen. I'm really looking forward to playing around with it.

Are there reasons why the preprocessing stage and the metaprogramming stage can't be merged into one stage?
This would reduce the assembler to only two kinds of code, but I don't know if there would be complications.

Author: Agner Date: 2017-08-11 10:00
Kai Rese wrote:

Are there reasons why the preprocessing stage and the metaprogramming stage can't be merged into one stage?

The preprocessor has no knowledge of anything defined in the code. For example, it irritates me that you cannot do something like
#if defined(myFunction)
or
#if sizeof(myStructure) > 8
in C or C++.

The need for metaprogramming is higher in an assembler than in a compiler, and it would be nice to have metaprogramming features that can query information about defined symbols.

I think it is convenient to have preprocessing directives like #include and #if resolved before the assembler begins to interpret anything else because programmers are used to this concept and it is completely transparent what happens. If we would use some kind of metaprogramming for the same purpose then it becomes less obvious at what stage in the process these directives are executed.

I have encountered a problem with this strategy, though. Consider the following code, where % means define an assemble-time variable:

%A = R1
%B = R2
A = B

This is ambiguous. It could generate the runtime instruction R1 = R2, or it could change the assemble-time variable A to R2. A possible solution to this dilemma is to define a symbol for alias names. For example:
& parameter1 = R0
could define parameter1 as an alias for register R0. This would be useful for function parameters and local variables, but it would generate further confusion with three different prefixes: #, %, &

Any better ideas?

Author: Kai Rese Date: 2017-08-14 22:33
Agner wrote:

I think it is convenient to have preprocessing directives like #include and #if resolved before the assembler begins to interpret anything else because programmers are used to this concept and it is completely transparent what happens. If we would use some kind of metaprogramming for the same purpose then it becomes less obvious at what stage in the process these directives are executed.

This is a valid point, I still have to think about that.

My idea was to use metaprogramming as an extension to the preprocessor. For example, variables defined on meta-level could be treated as preprocessor-macros by the actual assembler. This policy also solves the ambiguity problem:
Agner wrote:

Code: Select all

    I have encountered a problem with this strategy, though. Consider the following code, where % means define an assemble-time variable:

    %A = R1
    %B = R2
    A = B
    This is ambiguous. It could generate the runtime instruction R1 = R2, or it could change the assemble-time variable A to R2.

If the symbol always gets set in the meta-code and always gets inserted in the assembler code, this example would always produce:
R1 = R2
We usually do not know the content of a register at assemble time. For that reason, we should be able to automaticly make a symbol an alias when defining with a register. Function arguments should also be able to be defined as aliases automaticly.

Agner wrote:

A possible solution to this dilemma is to define a symbol for alias names. For example:
& parameter1 = R0
could define parameter1 as an alias for register R0. This would be useful for function parameters and local variables, but it would generate further confusion with three different prefixes: #, %, &

This is a trade-off that has to be made. How many features do we need and how much complexity is still easy enough to use?
My vision was usage of the metaprogramming-features in a template-like fashion. For example, this is a template for an engine written in python, called mako. It outputs html-code:

Code: Select all

<%inherit file="base.html"/>
<%
    rows = [[v for v in range(0,10)] for row in range(0,10)]
%>
<table>
    % for row in rows:
        ${makerow(row)}
    % endfor
</table>
<%def name="makerow(row)">
    <tr>
    % for name in row:
        <td>${name}</td>\
    % endfor
    </tr>
</%def>

My suggestion is aiming for a fairly simple code, maybe at the price of a more complex interpretation. I really don't know if this would be a better way than the current way of using different prefixes. It could also cause more problems with debugging, as you mentioned. There also might be many other problems I'm currently not seeing, but it is an idea nonetheless.

Author: Agner Date: 2017-08-14 23:44
Kai Rese wrote:

If the symbol always gets set in the meta-code and always gets inserted in the assembler code, this example would always produce:
R1 = R2

The meta-code still needs access to its own variables, for example a loop counter. Perhaps we can use a percent sign to indicate that a meta-variable is set or modified.

%A = R1
%B = R2
A = B // produces R1 = R2
%A = B // modifies A to equal R2

Author: Kai Rese Date: 2017-08-15 21:04

With what we have so far, it should work like this:

Code: Select all

%A = 0 // sets A as value
%B = A + 5 // allowed if A is a value

%A = R1 // sets A as alias
%B = A + 5 // invalid as A currently is an alias

%A = R1 + 5 // could be allowed as complex symbol, could also get messy

int64 R2 = A // allowed if A is defined, could be a move or a set instruction
int64 R2 = A + 5 // allowed if A is a value, otherwise not allowed

A = R2 // allowed if A is an alias, otherwise not allowed

I used the percent sign as indicator for metacode. I prefer the percent sign over dots because dots are fairly easy to overlook.
We also could make an alias require a special or different sign. It might benefit functionality, but could lead to more confusing code.

Post by **agner** » 2017-11-02, 14:16:00

Author: Hubert Lamontagne Date: 2017-08-22 12:46
I've taken another quick look today at the specs to figure out how feasible a Verilog implementation running on a FPGA would be, and I'm finding that register file read port counts are on the higher end... 2 ports per AGU (address generating unit), plus 4 ports per ALU (3 operands + 1 mask) and 4 ports per vector unit (3 operands + 1 mask)... This also affects the size of the input dependency tracking mechanism (and, I guess, the register renamer).

So I'm wondering, how many ports can modern register files manage, how does it scale with chip size and frequency, and how many ports are realistic on a modern mid-sized FPGA? (like, a 30$ FPGA or something?)

Author: Agner Date: 2017-08-23 00:37
Hubert Lamontagne wrote:

register file read port counts are on the higher end... 2 ports per AGU (address generating unit), plus 4 ports per ALU (3 operands + 1 mask) and 4 ports per vector unit (3 operands + 1 mask)... This also affects the size of the input dependency tracking mechanism (and, I guess, the register renamer).

The idea of ForwardCom is that it should be a high-end vector processor comparable to x86, ARM, etc. but with a better design. The AGU needs addressing modes with base and index as well as the special addressing mode for vector loops. The ALU needs 3 operands for fused multiply-and-add which is required for a state-of-the-art processor, and I think that masking is also required for a high-performance vector processor.

I have limited the number of output registers to one in order to simplify dependency tracking and register renaming, following our previous discussion. This means no flags register, no conventional add-with-carry instruction, and no single-instruction stack POP.

Current x86 processors have at least the same number of register inputs, plus segment register, and at least two register write ports.

My idea is to combine the best from RISC and CISC. The weakness of CISC is that the decoding is complicated, and it is difficult to decode multiple instructions in one clock cycle. They are caching decoded instructions to deal with this, which is a waste of silicon space. ForwardCom deals with this by making decoding simple. The weakness of RISC is that you need more instructions to do the same. For example, RISC-V needs a long sequence of instructions just to calculate the address of a memory operand. If you can do the same with a single instruction, you get a much higher overall throughput.

Why do you think register read ports are so expensive?

Author: Hubert Lamontagne Date: 2017-08-27 10:12
Agner wrote:

The idea of ForwardCom is that it should be a high-end vector processor comparable to x86, ARM, etc. but with a better design. The AGU needs addressing modes with base and index as well as the special addressing mode for vector loops. The ALU needs 3 operands for fused multiply-and-add which is required for a state-of-the-art processor, and I think that masking is also required for a high-performance vector processor.

I have limited the number of output registers to one in order to simplify dependency tracking and register renaming, following our previous discussion. This means no flags register, no conventional add-with-carry instruction, and no single-instruction stack POP.

Current x86 processors have at least the same number of register inputs, plus segment register, and at least two register write ports.

My idea is to combine the best from RISC and CISC. The weakness of CISC is that the decoding is complicated, and it is difficult to decode multiple instructions in one clock cycle. They are caching decoded instructions to deal with this, which is a waste of silicon space. ForwardCom deals with this by making decoding simple. The weakness of RISC is that you need more instructions to do the same. For example, RISC-V needs a long sequence of instructions just to calculate the address of a memory operand. If you can do the same with a single instruction, you get a much higher overall throughput.

Why do you think register read ports are so expensive?

I think this is one of those kind of things where it really depends on which tech you're aiming... Intel and AMD have huge register files with tons of ports on their modern cores. That probably takes up tons of silicon and suck out tons of power (and is probably hard to get running at very high clock rates). But that kind of chip is your target, so it's appropriate. Come to think of it, you're reminding me of the whole segment thing, and that's making me quite curious as to how x86 cores track that - it's the kind of thing you'd possibly want to keep in a separate register file and give it its own register renamer to reduce the level of contention with the rest of the registers.

On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.
Here's a paper on this.

As for RISC-V, they're obviously targeting smaller implementations (kindof like ARM), which are often in-order with 1 or 2 instructions per cycle. At that performance point (and in small out-of-order CPUs), this MIPS-inspired design is pretty much the optimal design (along with ARM-style), as far as I can tell. Though you do have a point, that they might be overdoing simplicity by not having R+R*n addressing modes - although the performance hit of that probably varies a lot per application. I think a very large proportion of memory reads in non vector code are stack pointer or object pointer + offset, which probably reduces the impact of lacking more complex addressing modes.

One thing ARM has done is to keep the number of input dependencies per uop low at 2 for the integer part of the core (these ops HAVE to run at low latency), and let vector ops have 4 inputs (you can see the cost of this if you look at result latency for NEON ops - often 3+ cycles!).

Author: Agner Date: 2017-08-28 01:35
Hubert Lamontagne wrote:

On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.

What do you think of an implementation like this:

There is a permanent register file, but no temporary register files. Temporary register values are stored instead in the instructions that need them in the reservation station. Each instruction has its register operands replaced by their values immediately after the decode stage if the values are available in the permanent register file. Register operands that are not yet available are replaced by the tag of the pending instruction that will produce the value. Each ALU has a write port that writes the result to all pending instructions that needs it, as well as to the permanent register file. ForwardCom does not allow instructions with more than one output value.

So we will need four or five register read ports in the early stage in the pipeline for reading the permanent register file, but no read ports for a huge temporary register file.

Vectors are implemented as follows. The pipeline is split into 64-bit lanes. Each lane has its own 64-bit register file, reservation station, pipeline, and 64-bit ALU. We can add as many 64-bit lanes as we need to support a certain vector length. There is one scheduler which controls all the lanes. This can run smoothly and fast as long as there is no exchange of data between the lanes. Instructions that move data between the lanes, such as broadcast and permute, are implemented in a separate unit with a single bidirectional data path to all the lanes. Such instructions will take one or two clock cycles extra, as they do in x86 processors.

The in-order front end has its own ALU's which can handle simple integer instructions. Simple instructions on general purpose registers are executed here if their operands are available (as I have proposed in a previous post). Loop conters and simple branches will be executed in the front end so that the instructions of the loop body are sent to the out-of-order back end with the loop counter replaced by its value. This will reduce the branch misprediction penalty. Registers used for memory pointers and indexes will also mostly be resolved in the front end so that the memory operands can be fetched early.

Author: Bigos Date: 2017-08-28 15:36
Agner wrote:

Hubert Lamontagne wrote:
On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.
What do you think of an implementation like this:

There is a permanent register file, but no temporary register files. Temporary register values are stored instead in the instructions that need them in the reservation station. Each instruction has its register operands replaced by their values immediately after the decode stage if the values are available in the permanent register file. Register operands that are not yet available are replaced by the tag of the pending instruction that will produce the value. Each ALU has a write port that writes the result to all pending instructions that needs it, as well as to the permanent register file. ForwardCom does not allow instructions with more than one output value.

So we will need four or five register read ports in the early stage in the pipeline for reading the permanent register file, but no read ports for a huge temporary register file.

I think this is how most x86 architectures worked till Sandy Bridge on Intel side and Bobcat/Bulldozer on AMD side. Later they have switched to PRF-like architecture (Physical Register File) where data only moves where necessary and is not tied to the pipeline at the early stages. The reason for that was power - moving so much data (especially with 256-bit vectors introduced in Sandy Bridge) took a lot of it. There were also read ports issues on P6 architectures till Sandy Bridge that you mentioned in your microarchitecture paper which were probably fixed just because of that - meaning it was probably not scalable enough.

Vectors are implemented as follows. The pipeline is split into 64-bit lanes. Each lane has its own 64-bit register file, reservation station, pipeline, and 64-bit ALU. We can add as many 64-bit lanes as we need to support a certain vector length. There is one scheduler which controls all the lanes. This can run smoothly and fast as long as there is no exchange of data between the lanes. Instructions that move data between the lanes, such as broadcast and permute, are implemented in a separate unit with a single bidirectional data path to all the lanes. Such instructions will take one or two clock cycles extra, as they do in x86 processors.

Unless each vector pipeline uses the same register names for each part of the same bigger register, you would have to make these whole-register operations take N times as many sources (where N is the number of 64-bit pipelines), which would probably we an overkill for a scheduler.

Also, I don't see what would be the benefit of such a partitioning? Of course, some level of partitioning is needed (for example for power-gating unused data paths or separating some per-line units like adders), but that is physical (you could split it for each bit if you wanted); I don't think it would help the register files much - you still need as many read and write ports.

The in-order front end has its own ALU's which can handle simple integer instructions. Simple instructions on general purpose registers are executed here if their operands are available (as I have proposed in a previous post). Loop conters and simple branches will be executed in the front end so that the instructions of the loop body are sent to the out-of-order back end with the loop counter replaced by its value. This will reduce the branch misprediction penalty. Registers used for memory pointers and indexes will also mostly be resolved in the front end so that the memory operands can be fetched early.

Also, about the register file, the most problematic are the write ports. You can always double the number of read ports by duplicating the register file and having each write take place in both copies; the scaling is less than linear due to the writes being more complicated (have to drive more inputs), but is mostly fine. Minimizing the number of write ports is the most important thing.

You can also partition your register space into multiple classes, each of these having a different register file (Intel does it with the kN mask registers). This effectively resets your register file port count, unless you have to duplicate all of the ports in order to support all of the possible register data transfers at the same time. You can, for example, partition the scalar and vector registers into different files (like AMD does) in order to increase the maximum possible throughput if you use both scalar and vector instructions at the same time (Zen can handle up to 10 instructions partially thanks to that, though only for short bursts due to different bottlenecks elsewhere in the pipeline; Intel can only do 7, not counting the separate store port)

BTW, a similar partitioning scheme can be used for caches (the most obvious being I$ vs D$). For example, the Mill architecture has 2 different I$ partitions - one for instruction chunks going forward and another for the ones going backward. Each jump target is really a middle of code and each instruction is halved, with half going forward and half going backward. Since each instruction half will only ever go to a single cache, you effectively double the cache capacity without compromising the latency and complexity (just 2x area and power for having twice as much memory).

https://millcomputing.com/docs/encoding/

Author: Hubert Lamontagne Date: 2017-08-28 17:05
Agner wrote:

Hubert Lamontagne wrote:
On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to.
What do you think of an implementation like this:

There is a permanent register file, but no temporary register files. Temporary register values are stored instead in the instructions that need them in the reservation station. Each instruction has its register operands replaced by their values immediately after the decode stage if the values are available in the permanent register file. Register operands that are not yet available are replaced by the tag of the pending instruction that will produce the value. Each ALU has a write port that writes the result to all pending instructions that needs it, as well as to the permanent register file. ForwardCom does not allow instructions with more than one output value.

So we will need four or five register read ports in the early stage in the pipeline for reading the permanent register file, but no read ports for a huge temporary register file.

Hmm, interesting. I think I've read that some Intel P2 family cores work something like this somewhere. Though I'm a bit worried that this might make the Reservation Station rather big, since each input operand of each instruction in the Reservation Station has to watch all the write ports all the time and multiplex it in... So your reservation station entry needs a register per operand, plus a multiplexer for each operand so that the appropriate result can be saved. This is kindof moving the hugely-multiple-write-ported register file into the reservation station.

I think in that case your permanent register file also still needs about as many ports as in a design where the same register file is used both for permanent and temporary values.

It might make sense to go for a design where registers are renamed to different physical registers depending on which unit produces them. For instance, ALUs and memory load units could have a different physical register pool internally (even though they look the same to programmers) so that they don't have to share write ports, and the register renamer can relatively easily keep track of what's in which pool. Generally for FPGAs, register files with many load ports and a single write port tend to be the better plan - the Verilog tool simply creates as many register file copies as it needs.

Vectors are implemented as follows. The pipeline is split into 64-bit lanes. Each lane has its own 64-bit register file, reservation station, pipeline, and 64-bit ALU. We can add as many 64-bit lanes as we need to support a certain vector length. There is one scheduler which controls all the lanes. This can run smoothly and fast as long as there is no exchange of data between the lanes. Instructions that move data between the lanes, such as broadcast and permute, are implemented in a separate unit with a single bidirectional data path to all the lanes. Such instructions will take one or two clock cycles extra, as they do in x86 processors.

Quite sensible and I'd bet dollars to pennies that most SIMD units are built this way ;3

The in-order front end has its own ALU's which can handle simple integer instructions. Simple instructions on general purpose registers are executed here if their operands are available (as I have proposed in a previous post). Loop conters and simple branches will be executed in the front end so that the instructions of the loop body are sent to the out-of-order back end with the loop counter replaced by its value. This will reduce the branch misprediction penalty. Registers used for memory pointers and indexes will also mostly be resolved in the front end so that the memory operands can be fetched early.

Hmm, I wonder if it's possible to track which instructions/operands/values are "front end" and tag them in the instruction cache, so that they can be separately issued? Using some kind of heuristic like "result used in address computation" or "result not coming out of memory load/sufficiently stale by the time it gets used". Like, a front/back-endness predictor ;3

Post by **agner** » 2017-11-02, 14:20:39

Author: yeengief Date: 2017-09-20 14:03
Re variable instruction lengths: Consider some forced alignment, e.g. every 16-byte (or every 32-byte) boundary MUST begin a new instruction. This has minimal impact on the effective code density (if your compiler is not stupid), while allowing almost random access decoding of instructions. This is good for the processor (short dependency chains with respect to instruction decoding/parsing, as opposed to infinite), and even more important for reverse engineering software, debuggers, etc. Even better: Declare that only 16-byte aligned addresses are valid jump targets.

Effect: You can decode a huge binary, and be sure that all valid jump targets have been found. It is easy to check whether some instruction is contained in the code or not: use grep. Your compiler should be able to ensure alignment without issuing too many NOPs.

Enforce "write XOR execute". This has a lot of advantages. Firstly, security-- but this should rather be the application programmer's job.

Secondly, simplified cache coherency: instructions and data use separate caches, and attempting to ensure that a write to an instruction already in the pipeline does correctly roll back... yeah. On ARM you need to explicitly flush the instruction cache for this to work.

Third, and the biggest advantage: Consider QEMU. Emulating ARM is a breeze: you KNOW where code is located, and every jump/instruction cache flush triggers software decoding, translation to e.g. x86 and optimization (LLVM), caching of the result, etc, and then continuation.

Do not try this with x86. You don't know where instructions start, you don't know whether the code is self-modifying, everything sucks.

Btw: self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).

Indeed, I imagine that your ISA could first find adoption as an intermediate representation (like llvm IR)-- if you ensure that it is very fast to jit/compile from your language to fast x86 and arm, and pretty fast to verify properties of code.

Bonus request: Performance-hint/NOP: Allow the instruction to hint at likely-ness, often the compiler/programmer can predict the branch pretty well.

Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there. Or maybe provide a separate magic memory location for this, if the stack would clash too much with calling conventions.

Desired effect: The compiler/programmer is limited by the number of addressable registers. Often, you will need to temporarily store stuff you need later. In reality, more registers exist. Allowing the processor to avoid hitting the L1-cache and using a renamed/anonymous register should give huge performance boosts in future processors with more anonymous registers running legacy code.

At the same time, this makes dependency analysis (in software and hardware) easier.

Bonus request: Hardware support for constant-time operations. It turns out that this is quite important in a lot of crypto protocols (timing side-channels). I am not entirely sure how to best help here, but if it was possible to get an accurate cycle counter, and a NOP-until-cycle-xyz instruction, this would probably be supremely useful. Both instruction (count-cycle and wait-until-cycle) are allowed to take very long: these are rare instructions, and it is totally fine to take 10000 a cycle hit! On the other hand, these would also simplify a lot of debugging and performance testing.

Re floating point exceptions: I think control registers, trapping and fault-handling are the right way to go for "extremely unlikely" branches. This way, you basically have a conditional branch on every memory access or floating point operation (page fault?), which has a very high cost for mispredictions, and is very cheap on correct predictions (effectively zero overhead).

Author: Agner Date: 2017-09-20 15:23
Thanks for your thoughtful suggestions. This project may become a sandbox for many experiments and research on such ideas for more efficient computer systems.

yeengief wrote:

Re variable instruction lengths: Consider some forced alignment, e.g. every 16-byte (or every 32-byte) boundary MUST begin a new instruction. This has minimal impact on the effective code density (if your compiler is not stupid), while allowing almost random access decoding of instructions. This is good for the processor (short dependency chains with respect to instruction decoding/parsing, as opposed to infinite), and even more important for reverse engineering software, debuggers,

Currently, the ForwardCom ISA allows three different instruction lengths: 1, 2, and 3 words of 32 bits each. Decoding is easy because the instruction length is determined by only two bits. Forced alignment is applied only to tiny instructions of half-word size, which must be packed two-by two at word boundaries, and you can't have a jump target in between. The problems you are trying to solve certainly exist in the x86 world where it is very complicated to determine the length of an instruction, and some compilers put data in the code segment (i.e. jump tables). But I don't see any big problems in ForwardCom. I found that variable instruction length is important because it is difficult for the compiler to predict whether the distance for a relative jump will fit into a certain instruction size. It is easier to leave it to the assembler to find the optimal size of each instruction.

Regarding reverse engineering, I have allready made a disassembler. It works so well that the code can be re-assembled. I am soon finished making a high-level assembler. The syntax resembles C, but the programmer has to select the individual instructions and registers. I will put it on github later this autumn.

Enforce "write XOR execute".

Of course. Executable code is in execute-only sections by default. Execute-and-read access is possible, but write acces should not be allowed. Jump tables and other code pointers are in read-only sections. Self-modifying code is not allowed.

Bonus request: Performance-hint/NOP: Allow the instruction to hint at likely-ness, often the compiler/programmer can predict the branch pretty well.

Branch prediction matters only for branches that are executed millions of times. Static branch prediction affects only the first few executions of the millions, while dynamic branch prediction affects them all. If I understand you right, the likely-ness hint is a form of static branch prediction, so I don't see the point. The compiler may place the often-executed branches or functions in a hot section and the rarely executed branches, such as error handling, in a cold section that rarely pollutes the cache. I see this as a pure compiler issue which does not affect the ISA design.

The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there.

Each thread has private stack memory by default which is not accessible to other threads, except when legacy software requires otherwise. I am not sure I understand you, but maybe this is what you mean?

Hardware support for constant-time operations. It turns out that this is quite important in a lot of crypto protocols

Funny that you should mention this. I got the same idea a week ago. The next version will have optional support for a mask bit that dictates that execution time should be independent on the values of operands. This does not necessarily imply that execution time is predictable, only that it is independent of the values of sensitive data. Branches can be replaced by predicated instructions, and small lookup tables can be replaced by vector extract or vector permutation instructions.

Re floating point exceptions: I think control registers, trapping and fault-handling are the right way to go for "extremely unlikely" branches. This way, you basically have a conditional branch on every memory access or floating point operation (page fault?), which has a very high cost for mispredictions, and is very cheap on correct predictions (effectively zero overhead).

Fault trapping can be turned on or off for both floating point errors and integer overflow. Only problem is that the behavior depends on the vector length if an error occurs in a single vector element. NAN propagation is offered as an alternative if you want to trace errors to a single vector element and you want identical behavior on processors with different vector length.

Author: yeengief Date: 2017-09-20 16:22
>Currently, the ForwardCom ISA allows three different instruction lengths: 1, 2, and 3 words of 32 bits each. Decoding is easy because the instruction length is determined by only two bits.

The first problem with variable instruction lengths is that code has, in worst case, three different interpretations, depending on the entrypoint. I agree that this price is likely worth paying for the increased code density.

My problem is the long-range dependency of the correct decoding, depending on the entrypoint. Without restrictions, this dependency is effectively infinite-length. Q: what has x86 code and DNA in common? A: You need a hidden markov model to guess at the decoding if you don't know the entrypoints. This is bad, I don't want to care about BLAS when writing a disassembler.

To take a larger number: Suppose you say that every aligned 16 word block (nice, cache-line!) MUST begin a new instruction. Now, the length of the dependency of the "correct reading frame" for your code has shrunk down to one cache line.

What does this cost us? The very worst case is that we need to pad two NOPs when trying to emit a 3-word instruction near the end of a line. In other words, the code becomes 2/16 = 12.5% less dense, if the compiler can do no re-orderings at all. Code where every instruction is dependent on the previous one will be dog-slow anyway, so meh.

My maybe too strict initial proposal of cutting at 4 words, would give a maximal overhead of 50%, i.e. double the code length-- for code where no reordering is possible, and the lengths are maximally bad. Since instructions of length one are most frequent, and most of the time you have a little bit of freedom to reorder, this should not occur too often.

Restricting jump targets to certain alignments should likewise have only limited effect on code-length, except for the case where we have tiny tiny loops. Unfortunately, only allowing jumps to beginnings of cache-lines is probably too restrictive, otherwise we could get unambiguous decoding. This would simplify a lot of stuff, e.g. because the processor designer could decide to dynamically translate code into some internal uops, unambiguously and cached and independent of actual execution flow. In other words: We would need to decode instructions only on mapping pages as executable.

Also, it would simplify verification of code properties. For example, NaCL (google native client) uses such alignment restrictions to make the decoding of NaCL-compliant code statically unambiguous.

Execute-and-read access is possible, but write access should not be allowed.

Now that you mention it, why not go full Harvard? Would it be really bad to disallow read access to executable pages? Again, the idea is dynamic translation: the original code you want to read might have been swapped out to tape years ago, and if you really need access, just map it into a different page.

Sidenote: Apple's iOS is almost Harvard. Mapping a page to executable triggers code signature checks and loads of crypto, and removes writable. I think they still allow read access, though.

Author: Agner Date: 2017-09-20 23:52
yeengief wrote:

the processor designer could decide to dynamically translate code into some internal uops, unambiguously and cached and independent of actual execution flow.

Many x86 processors now have a micro-op cache because the decoding of instructions is a bottleneck. The ForwardCom ISA is designed so that decoding is very simple, never requiring more than one pipeline stage or one clock cycle. My idea is that a micro-op cache should never be needed. It should be easy to decode several consecutive instructions in parallel.

Regarding Harvard architecture, I imagine that there will be a code cache and a data cache, and perhaps a third cache for read-only data sections. But the hardware designer is of course free to choose some other configuration. The call stack and data stack are separate. The call stack will be so small that it can fit into its own little cache throughout the execution of a program, except for deeply recursive functions.

Author: yeengief Date: 2017-09-21 07:54

Many x86 processors now have a micro-op cache because the decoding of instructions is a bottleneck. The ForwardCom ISA is designed so that decoding is very simple, never requiring more than one pipeline stage or one clock cycle.

Yes, but decoding is still sequential and not parallel. My proposal makes decoding ridiculously parallel.

New number proposal: 8 words, a half cache-line. The beginning of every 8 word block must begin a new instruction. Only the beginning of 8-word blocks are valid jump targets. Hence, I can never jump into the middle of an instruction. (Alternative proposal: Decoding of instruction lengths is always relative to the most recent 8-word boundary. Entry into the middle of an instruction is a fault. Worst case price: The decoder must look back 7 words on a jump and decode the instruction lengths in order to determine whether the jump was faulty. This does not require a lot of extra instruction fetching because these 7 previous words are in the same cache-line anyway).

Worst case overhead for instruction alignment restriction: 25% is NOP. Worst case for jump target restriction: Harder to compute, but compilers should be able to manage without too many NOPs.

Gain: Decoding is ridiculously parallel, because I can decode every 8-word block independently. If decoding becomes the bottleneck, I just plug in more decoding units (i.e. we might still be limited by decoding latency, but never by decoding throughput). The longest possible dependency chain in decoding is 7. Your proposal has infinite dependency chains in decoding (you must know the reading frame).

I can statically analyze code coming in 8 word-blocks and be sure to never have mis-decoded an instruction.

Another bag of problems we can avoid: Suppose some program uses code in two valid decodings, dependent on entrypoint. This means that your branch predictor, uop-cache, etc must work on the tuple (memory location, reading frame), where reading frame is one of 0,1,2. With my proposal you cut this down to just memory location, which simplifies the logic a lot-- because the logic for caching uops or other processor generated metadata is the same logic as for caching data and just maps address -> stuff.

The only real-life code that does such stuff (e.g. uses a 2-word instruction as a different instruction by jumping into the middle) is shellcode. While shellcode is fun and all, this is maybe not the most important customer group. You target "real" compilers, and not return oriented programming compilers.

Author: Agner Date: 2017-09-21 12:54
yeengief wrote:

Yes, but decoding is still sequential and not parallel.

My plan is indeed that decoding in parrallel should be possible. Assume that you are loading 32 bytes (= 8 words) of code in one clock cycle. Each 4-byte word has two bits that may determine instruction length. This gives 16 bits in all. You need a combinational logic circuit with 16 inputs to determine where all the instruction boundaries are. This can easily be done in one clock cycle. There is no reason to decode too far ahead because mispredicted branches may spoil the advantage. I think it is better to spend your resources on decoding multiple possible branches simultaneously when facing a branch with poor prediction. And, as I said, a micro-op cache is not needed.

Author: yeengief Date: 2017-09-21 14:47
>My plan is indeed that decoding in parallel should be possible. Assume that you are loading 32 bytes (= 8 words) of code in one clock cycle. Each 4-byte word has two bits that may determine instruction length. This gives 16 bits in all. You need a combinational logic circuit with 16 inputs to determine where all the instruction boundaries are. This can easily be done in one clock cycle. There is no reason to decode too far ahead because mispredicted branches may spoil the advantage. I think it is better to spend your resources on decoding multiple possible branches simultaneously when facing a branch with poor prediction. And, as I said, a micro-op cache is not needed.

If you really, really don't want this,.... But I am telling you that you will regret this choice the next time you need to software-decode gigabytes of data (e.g. malware sample database, memory snapshots for forensics) or when you want to write a QEMU module for emulating your processor, or when you want to implement a sandbox for untrusted code (see NaCL, web-assembly, etc), where you want a whitelist of acceptable instructions (in order to protect from unknown kernel/driver bugs). And every hacker will tell you that the ambiguity of instruction decodings is absolutely bonkers (hey, I overflowed over a function pointer but have no w&x page.. let's jump into the middle of some benign instruction to give an entirely different meaning to the rest of the code).

In fact: How would you software decode a big binary blob and check whether it contains a forbidden instruction?

Sequentially, and letting all but one core idle?

Speculatively and parallel, throwing away 2/3 of the work?

Ah, this does not really work: You cannot throw away 2/3 of the misaligned work because you wanted to reach all possible decodings that the processor can see. Hence, your data has grown by a factor of three during "complete" decoding, and you will need to use complex and faulty heuristics to remove unneeded stuff (oh, if I jump here then I will hit a invalid instruction down the road, by symbolic execution. Hence, this probably was not the right reading frame.... Hah! it was an undocumented instruction and you just missed the crucial part of the code!).

This is not hypothetical. This hell is reality, today, behold IDA. Please don't make more of it.

And, as I said, a micro-op cache is not needed.

It is needed if I want to run your binary in QEMU on a different architecture. The very first users of a new architecture are people who want to code/debug on an emulated version, long before the first chip is produced.

And who knows whether a hardware micro-op cache will be needed in the future? By allowing jumps into the middle of instructions you will force everyone in the future to support such legacy behaviour which is effectively good for shellcode only.

Author: Agner Date: 2017-09-23 00:43
Enforcing instruction alignment at cache boundaries doesn't help much, because malicious code can jump over the boundaries. You would have to enforce alignment of all jump targets as well, which means a lot of wasteful NOPs in the code.

Another possible solution is to use the same principle as in UTF-8 code: single-byte UTF-8 codes and continuation bytes can be distinguished by the first bit. A similar principle could be implemented in ForwardCom. The first word of a multi-word instruction has the most significant bit = 1. The last word of a multi-word instruction has the most significant bit = 0. A single word instruction has the most significant bit = 0. This will make it impossible to execute stealth code by misaligning an entry. To implement this feature we need an extra bit in the first word of each multi-word instruction to replace the bit we have used in the last word. If the last word contains a 32-bit payload such as a 32-bit address or a floating point number, then all binary tools will need to have the extra complication of inserting the missing bit. This is a clumsy solution, but possible. The first word of multi-word instructions has no unused bits so we will have to sacrifice some other feature if we want to make space for this extra bit.

We have to make a cost benefit analysis to see if this feature is worth the costs. The costs of making a bit to distinguish start words from end words are: fewer bits for instruction features, clumsy and complicated binary tools, and slower emulation. The advantage is easier analysis of potentially malicious code.

I wonder it the kind of analysis you are talking about actually is an efficient weapon against malicious code. It is not easy to jump to an unauthorized address on ForwardCom without detection. All jump tables and function pointers are in read-only memory, and the call stack is separate from the data stack and isolated from everything else. The only way a malicious code could jump into an unrecognized address is by jumping to a calculated address and make this calculation difficult to trace. A scanner for malicious code should look for any jump to a calculated address and do a more thorough analysis only if such a jump occurs. This extra analysis wouldn't be so time consuming after all because disassembling is fast.

I would be more concerned about malicious code calling a system function with a calculated ID. Running in a sandbox, the code would call a benign system function, while under certain conditions or on a certain date, the malicious code would call a different system function with different parameters. An intelligent analysis tool could detect that the code uses a calculated system function ID and calculated parameters, but it may be impossible for even the best analysis tool to detect what the code can do if the calculation of system function ID uses cryptographic techniques. We cannot prevent this kind of attack by code alignment techniques and analysis of binary code. I think the best prevention against threats like these is to make restrictions on the access rights of each executable file. I have proposed the following security principle for ForwardCom: An executable file is not allowed to run unless it has been installed by a standardized installation procedure. This procedure includes the assignment of access right to the executable file, such as whether the file is allowed to access various resources.

Author: - Date: 2017-09-22 22:05

Btw: self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).

I'm not sure about strictly enforcing W^X. I get that it really messes with processor optimizations and all, but applications such as emulators which make heavy use of dynamic recompilation would suffer if they are forced to syscall every time they modify code. https://news.ycombinator.com/item?id=11789169
I think an instruction to explicitly flush the instruction cache could be a good compromise.

I haven't looked at this architecture yet, though I do wonder how cache flushing instructions could mess with security issues like cacheline based attacks.

Author: Agner Date: 2017-09-23 01:19
- wrote:

self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).

I am not too fond of the trend of JIt compilation and byte code. If you are JIT compiling each part of a program separately, you get erratic and unpredictable response times which users hate. I would prefer to compile everything once. ForwardCom is designed to be forward compatible in the sense that a compiled program can run optimally without recompilation on future microprocessors with longer vector registers. This means less need for JIT compiling. Another planned feature is re-linking of executable files. Instead of calling a DLL or shared object, you use static linking to include the necessary library functions in the executable, with the possibility to re-link later if a library function needs replacement. This makes the code run faster than if dynamic linking is used, and it makes it possible to adapt the code to different environments by replacing e.g. a user interface library.

I'm not sure about strictly enforcing W^X. I get that it really messes with processor optimizations and all, but applications such as emulators which make heavy use of dynamic recompilation would suffer if they are forced to syscall every time they modify code

JIT compilation is also a security problem. If we allow self-modifying code or code that makes code and executes it immediately, then we have all kind of security problems. I would like to re-think the whole idea of JIT compilation.

Emulation is a problem, I agree, if we don't allow dynamic compilation. An emulator running under ForwardCom and emulating something else would need special permission to do dynamic compilation and running the code it generates. The user will be warned when installing the emulator that it needs dangerous special permissions.

Author: Hubert Lamontagne Date: 2017-09-25 16:43
yeengief wrote:

Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there. Or maybe provide a separate magic memory location for this, if the stack would clash too much with calling conventions.

Desired effect: The compiler/programmer is limited by the number of addressable registers. Often, you will need to temporarily store stuff you need later. In reality, more registers exist. Allowing the processor to avoid hitting the L1-cache and using a renamed/anonymous register should give huge performance boosts in future processors with more anonymous registers running legacy code.

At the same time, this makes dependency analysis (in software and hardware) easier.

That's an interesting proposal, though I think you'd probably end up needing 2 different stacks. The C++ stack is a mosaic of data that the compiler knows that the user "can't" access (like local values of previous functions that are never accessed through a pointer, register saving, return addresses), and data that the compiler knows is "pointer accessible" (values that the code gets the address of). "Pointer accessible" data needs to be readable across threads, too (it's totally legal to have another thread operate on data on your stack, as long as you can guarantee that your function won't exit before the other thread is done). "Pointer accessible" data can't be renamed to a register - it basically has to come out of the same accesses as other memory read/writes and must preserve order and page-faulting behavior (which basically means it has to be in L1 data cache). But the "non-pointer accessible" stack could probably have its own separate memory space, which would make it possible to give it its own separate cache and read/write to it in parallel, and even do register-renaming on it as you propose.

Author: Agner Date: 2017-09-26 00:05
Hubert Lamontagne wrote:

yeengief wrote:
Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there.

I am not sure how this differs from my proposal: Each thread has two stacks. The call stack is private and separate from everything else (and usually so small that it can be cached entirely on-chip). The data stack is private to the thread that owns it. This requires some discipline from the programmer. Only static memory or memory allocated by the main thread is accessible to all threads. Or you may have a special malloc_shared function for allocating memory that is accessible to all threads. Semafors and mutexes are placed in shared memory.

The private data stack is used for function-local variables and register spilling. Since it is private, it does not need coherence across threads. This makes caching of local data simple and efficient. It is also good for safety, of course.

Legacy software needs to turn off the private stack feature if it is sharing local memory between threads. A flag in the executable file header could control this feature.

Author: Hubert Lamontagne Date: 2017-09-26 01:37
Agner wrote:

Hubert Lamontagne wrote:

yeengief wrote:

Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there.

I am not sure how this differs from my proposal: Each thread has two stacks. The call stack is private and separate from everything else (and usually so small that it can be cached entirely on-chip). The data stack is private to the thread that owns it. This requires some discipline from the programmer. Only static memory or memory allocated by the main thread is accessible to all threads. Or you may have a special malloc_shared function for allocating memory that is accessible to all threads. Semafors and mutexes are placed in shared memory.

The private data stack is used for function-local variables and register spilling. Since it is private, it does not need coherence across threads. This makes caching of local data simple and efficient. It is also good for safety, of course.

Legacy software needs to turn off the private stack feature if it is sharing local memory between threads. A flag in the executable file header could control this feature.

The difference is that I'm seeing it from the C++ compiler point of view, and how the current ecosystem of malloc+mutexes+threads works in complex applications. As it is, program values that you don't take the address of are indeed truly local, and will show up as renamable values in the SSA pass and not translate into GEP+Load+Store instructions, and can safely be moved into a non-addressable memory like a special stack (or, obviously, a register). This class of values includes register spilling, argument passing, and function return addresses - none of these are "really" visible to the programmer (except for the value), so they can also be moved to a private stack - even in existing C++ code! This doesn't even require special discipline. You can do really crazy optimizations on this "value-only" data, like reordering the reads/writes, assuming perfect alignment and no partial value writes, making page faults and reads/writes from other threads impossible, allowing register renaming and so forth, because you're protected from all the crazy cases since the user doesn't have the address and can't pass it around halfway across the program.

As soon as you take a pointer or reference, it automatically becomes visible to anything that you can pass that pointer to, which turns out to be "the whole rest of the process including other threads". This forces you to give it a real memory address and do real Loads/Stores that run in program order and are visible across all threads. And it's only after that, that you can put in your optimizer pass that tries to remove the Loads/Stores if it can prove that it doesn't change anything in program behavior - which turns out, is really hard. Even just reordering Loads/Stores is hard, because it's hard to prove that function pointers will never overlap and rule out even really dumb cases like input and output buffers to a function overlapping out of alignment.

The difference is that my proposal is based on pointer sharability:
- The main data stack for each thread is NOT thread local and is coherent across the whole process, other threads can read/write in its memory range. It is used for pointer-accessible local values.
- The call stack is local and can also contain anything the compiler KNOWS is non-pointer-accessible such as function call register saves/restores and spillovers and local arrays, and is technically isn't even in paged memory, and isn't accessed with normal load/store instructions.
- Memory allocated using malloc() is accessible from all threads (even if it's allocated by a worker thread). This is necessary for tons of existing C++ code. You can provide a private_malloc() function.

Post by **agner** » 2017-11-02, 14:30:24

Author: Agner Date: 2017-04-27 00:03
I am working on the binary utilities for ForwardCom, and I have got an idea of a new concept for function libraries.

The well-known method for static linking is used. An option to the linker specifies that certain object files or library files should be re-linkable. The necessary symbol and relocation records are then preserved in the executable file so that the library functions can later be replaced by repeating the link process.

This concept has many advantages:

The executable file is ready to run without any external dependencies. You will never have problems with a missing DLL or a wrong version of a shared object.
An important advantage of static linking is that you don't have to load a large dynamic library containing hundreds of functions when you only need a few of these functions. The re-linking feature combines this advantage with the ability to update or replace a library later or to use a third-party library, which traditionally requires dynamic linking.
If you update a DLL in Windows, you are affecting all programs that use this DLL, even if some programs work better with the old version. With re-linkable libraries, you can update a library in each program separately.
You can update or modify a program without access to the original source code.
The rarely used symbol interposition feature in shared objects is very inefficient because all accesses to public symbols go through global offset tables (GOT) and procedure linkage tables (PLT). These inefficient table lookups are not needed with the relink method
You can have different versions of your functions for different platforms. For example, the same program can have different graphical user interfaces for different environments. The appropriate version can conveniently be linked to the executable program as part of the installation procedure.
You can have specific versions of a critical piece of code optimized for different microprocessor models.
You can reconfigure a program at any time if you update the hardware or operating system.
You are not wasting time on loading and linking dynamic libraries every time a program is loaded.
Modifying or updating an executable program requires administrator privileges, for security reasons.
While this concept is intended for ForwardCom, it can also be implemented in existing systems. The only thing this requires is a new linker. Run-time loading of this kind of library is only possible in ForwardCom, though.

I have previously proposed a feature to link in libraries when a program is loaded, where the choice of library can depend on environment variables, or whatever. Do we still need this feature if we have relinking at install time? I think it is safer to relink when a program is installed than when it is loaded, because a hacker might insert a tampered-with library to be linked at load time.
If relinking requires administrator privileges then it is more convenient to do it when the program is installed. In the rare cases where a choice between different versions needs to be done at load time, it would be safer (and faster) to store different ready-made versions of the executable file.

Any comments?

Author: Hubert Lamontagne Date: 2017-04-27 17:18
Interesting. You'd still probably need some kind of dynamic linking for plugins (since a same set of interface functions can lead to multiple different plugins loaded at the same time... maybe you could do it with some kind of OS-supplied dispatcher), but it would probably cut the usage of dynamic linking a lot, yes.

Author: Agner Date: 2017-04-27 23:13
Hubert Lamontagne wrote:

You'd still probably need some kind of dynamic linking for plugins.

You are right. A plugin could in principle be statically linked into the executable, but you would probably prefer third party plugins to be loaded at runtime.

Maybe we need to mark functions and data in the executable as public or private, where only public symbols can be accessed by plugins. But a plugin, whether statically or dynamically linked, is still able to access everything that it can find the address of. This might be a security problem. Even if there is no malicious intent, a poorly behaved plugin that messes with things it shouldn't touch might cause compatibility problems.

This is also a problem in existing systems, of course, but I would like ForwardCom to be safer. Maybe we need a system where each plugin is running as a separate thread or process where we can control its access rights?

Author: Hubert Lamontagne Date: 2017-05-10 16:43
Agner wrote:

Maybe we need a system where each plugin is running as a separate thread or process where we can control its access rights?

I think that class of problem tend to be already solvable using IPC (inter process communication)... And it wouldn't address some of the use cases I've seen:

- Java-C++ interfacing is done with DLLs on windows, and presumably programs do tons of small function calls back and forth, and do stuff like sharing large arrays in memory. The first generation of Android apps have a ton of this stuff going on (to interface between C++ code and Android's Java APIs), and getting the thread-to-thread overhead every time you read a button or something like that would've been a problem.
- Some script languages also use dynamic linking to interface with C++ (like, I think Python does this), and they will do tons and tons of small function calls.
- Audio processing plugins need to run in the same thread as the caller because they use tiny buffers (like, 1ms tiny - which means that the whole processing of a block of audio must take less than 0.001s before sending to the audio hardware every single time). If the OS's scheduler gets involved in any way you'll have dropouts for sure.
- Some game engines allow for plugins to extend features, which might require sharing access to very large data structures like level data in memory.

In a way, this reminds me of the whole micro-kernel thing, which had the same issue but between the applications and system modules (applications don't want to wait after the scheduler to call system module functions!).

Author: Agner Date: 2017-05-11 01:07
Good examples, Hubert.

I think the optimal solution for Java, C#, and script languages is just-in-time compiling. Actually, I don't understand why there are so many slow applications out there that don't use JIT compiling

The code only has to be compiled the first time it is run. Or - why not distribute it in compiled form for ForwardCom? The compiled code is guaranteed to be forward compatible with future ForwardCom processors.

The compiled code can easily be linked to C/C++ code in the JIT process. The calling conventions of ForwardCom are standardized to work across different programming languages.

You are probably right that interprocess communication is a good solution for plug ins if you don't want the plug in to have access to mess with everything in the mother program.

Author: Hubert Lamontagne Date: 2017-05-12 17:34
Agner wrote:

Good examples, Hubert.

I think the optimal solution for Java, C#, and script languages is just-in-time compiling. Actually, I don't understand why there are so many slow applications out there that don't use JIT compiling

The code only has to be compiled the first time it is run. Or - why not distribute it in compiled form for ForwardCom? The compiled code is guaranteed to be forward compatible with future ForwardCom processors.

The compiled code can easily be linked to C/C++ code in the JIT process. The calling conventions of ForwardCom are standardized to work across different programming languages.

You are probably right that interprocess communication is a good solution for plug ins if you don't want the plug in to have access to mess with everything in the mother program.

To load in C++ objects, the Java code calls System.loadLibrary("name"), which on windows loads "name.dll" and on posix loads "libname.so", so you still need either dynamic linking or arbitrary on-the-fly recompilation phases on already running apps, because your Java code is already running when you learn that you have to load this library. Another similar mechanism in Java is ClassLoader, which dynamically loads in more Java code into an already running app from arbitrary jar files on the disk (it's basically the Java equivalent of loading DLLs).

Author: Agner Date: 2017-05-13 00:58
Hubert Lamontagne wrote:

To load in C++ objects, the Java code calls System.loadLibrary("name")

ForwardCom also supports runtime dynamic linking. A highly optimizing JIT compiler might replace this by static linking.

forwardcom forum

The original proposal, 2 years ago

Intel's Control-flow Enforcement Technology

Assembler with metaprogramming features

Number of register file ports in implementations

Proposal for instruction set - now on Github

New idea: Re-linkable libraries in ForwardCom