Different instruction sets on different cores

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

JoeDuarte
Posts: 10
Joined: Tue Dec 19, 2017 6:51 pm

Different instruction sets on different cores

Post by JoeDuarte » Tue Dec 19, 2017 7:00 pm

Hi Agner – Do we need every core to support the same registers and instructions? There is some evidence that a logarithmic number system would be more efficient than floating point for many workloads (https://en.wikipedia.org/wiki/Logarithmic_number_system). It would be nice to have floating point on two cores, and logarithmic on two other cores, for example.

And if ForwardCom were to have AES and other crypto instructions, it seems like it would be fine to have them on just one core. There's no need to have that on every core – they won't be used.

Separately, how would ForwardCom fare with strings compared to SSE 4.2? I don't see any comparable instructions.

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Re: Different instruction sets on different cores

Post by agner » Wed Dec 20, 2017 5:55 pm

A logarithmic number system is efficient as long as you are using it for multiplication only, but difficult if you want to do addition. You need no extra hardware for multiplying logarithmic numbers - this is simply addition of integers. Another possibility is to use standard floating point numbers and add the exponents. ForwardCom has an instruction mul_2pow that adds an integer n to the exponent of a floating point number. This corresponds to multiplying by 2^n, or dividing if n is negative. This does floating point multiplication at the speed of integer addition.

I have not implemented something like Intel's SSE4.2 instructions for the following reasons:
  • These instructions are used mainly for manipulating human-readable text. Such texts are usually so short that execution time is negligible. Only applications such as DNA analysis are critical.
  • I don't want complicated instructions that need to be split up into micro-operations. This makes the whole pipeline more complicated and slower.
  • SSE4.2 is rarely used because it doesn't easily integrate into high level programming languages.
  • You can have an FPGA for application-specific instructions. This can be used for SSE4.2-like operations, cryptographic instructions, etc.

-.-
Posts: 4
Joined: Sun Dec 24, 2017 5:10 am

Re: Different instruction sets on different cores

Post by -.- » Sun Dec 24, 2017 5:28 am

JoeDuarte wrote:
Tue Dec 19, 2017 7:00 pm
And if ForwardCom were to have AES and other crypto instructions, it seems like it would be fine to have them on just one core. There's no need to have that on every core – they won't be used.
I would've thought that a very common application of crypto acceleration would be a multi-threaded HTTPS/VPN/etc server, where the acceleration units would need to be on each core to be used. You could just lock the server to one core, but then you'll be unable to use the other cores on the chip. Alternatively, you could have a process/thread running on the "crypto core" and pass data back and forth between the server's worker threads and the crypto thread, but that'd complicate the programming model a little (not too sure how much of a performance penalty this is) - still, it'd work I suppose.

It's interesting to note that Intel has announced the AVX512 VAES extension for upcoming Icelake processors, which can encrypt 4 streams in parallel. I don't know what purpose this is aimed at, but clearly they see a benefit for enabling more parallel encryption (or maybe it helps accelerate a single stream AES-CTR, though it being released along with VPCLMUL seems to suggest 4 parallel AES-GCM streams being the aim).

I've never done any work with FPGAs so cannot comment how it'd compare with a "dedicated" crypto core.

Kulasko
Posts: 3
Joined: Tue Nov 14, 2017 9:41 pm
Location: Germany

Re: Different instruction sets on different cores

Post by Kulasko » Sun Jan 14, 2018 6:28 am

-.- wrote:
Sun Dec 24, 2017 5:28 am
It's interesting to note that Intel has announced the AVX512 VAES extension for upcoming Icelake processors, which can encrypt 4 streams in parallel. I don't know what purpose this is aimed at, but clearly they see a benefit for enabling more parallel encryption (or maybe it helps accelerate a single stream AES-CTR, though it being released along with VPCLMUL seems to suggest 4 parallel AES-GCM streams being the aim).

I've never done any work with FPGAs so cannot comment how it'd compare with a "dedicated" crypto core.
FPGA implementations have a few drawbacks compared to ASIC implementations, the most notable perhabs being the attainable clock rate of a given block of logic (typically a few hundred Mhz today), therefore you will see higher latency. However, the forwardcom-ISA should cover the vast majority of latency-sensitive algorithms, as it describes a general purpose processor. For throughput-sensitive algorithms, you usually can just increase parellelism. In theory, you can design a wider FPGA implementation with a higher total throughput than a narrower ASIC implementation.

A current idea for forwardcom is to integrate FPGAs in CPU cores, the current specification version has reserved instruction codes for this purpose. It should be possible to supply a libary for the FPGA programming (by the operating system?) and then using these designs as one would use regular instruction extensions in other architectures. Of course, the supplied algorithm has to exploit enough parallelism and the program has to tell the operating system what algorithm it wants to run.

-.-
Posts: 4
Joined: Sun Dec 24, 2017 5:10 am

Re: Different instruction sets on different cores

Post by -.- » Fri Jan 19, 2018 10:06 am

I'd imagine that mostly serial encryption, such as AES-CBC, would suffer, speed-wise, on an FPGA compared to a CPU with dedicated AES instructions, though mostly parallel methods like AES-CTR could be better (for large enough amounts of data).

I haven't really looked at what ForwardCom provides though, so maybe it has other mitigations in place.

JoeDuarte
Posts: 10
Joined: Tue Dec 19, 2017 6:51 pm

Re: Different instruction sets on different cores

Post by JoeDuarte » Mon Jan 22, 2018 2:36 am

-.- wrote:
Sun Dec 24, 2017 5:28 am
JoeDuarte wrote:
Tue Dec 19, 2017 7:00 pm
And if ForwardCom were to have AES and other crypto instructions, it seems like it would be fine to have them on just one core. There's no need to have that on every core – they won't be used.
I would've thought that a very common application of crypto acceleration would be a multi-threaded HTTPS/VPN/etc server, where the acceleration units would need to be on each core to be used. You could just lock the server to one core, but then you'll be unable to use the other cores on the chip. Alternatively, you could have a process/thread running on the "crypto core" and pass data back and forth between the server's worker threads and the crypto thread, but that'd complicate the programming model a little (not too sure how much of a performance penalty this is) - still, it'd work I suppose.
I've never done any work with FPGAs so cannot comment how it'd compare with a "dedicated" crypto core.
You're right about the server use case. I was thinking of clients – desktop and mobile – where the crypto load is trivial (one webpage every minute?). Though ideally the disk/file system should be encrypted by default, and I'm not sure what kind of load that would generate.

I think it's suboptimal to have the same ISA for servers and clients, and to have such a vast number of instructions supported by every core. I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core. I don't think having a different ISA for servers vs. clients would be much trouble for developers, since developing applications for mobile and desktop is already quite different from server development and most developers don't interact with the ISA directly. 64-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices.

Agner seems to think that we can dish a bunch of work to FPGAs, like encryption. But I think it's very unlikely that OEMs will want to include FPGAs in most devices. Maybe he's just thinking of servers, which is more feasible, but even then I don't know that FPGAs will ever be common. An FPGA is going to add to the BOM and cost, and I doubt many customers will be clamoring for them. Very few developers have any experience with FPGAs, and they seem to be niche devices for things like high frequency trading. It's hard enough to get developers to use modern CPU instructions, vectorization, etc.

JoeDuarte
Posts: 10
Joined: Tue Dec 19, 2017 6:51 pm

Re: Different instruction sets on different cores

Post by JoeDuarte » Mon Jan 22, 2018 2:43 am

agner wrote:
Wed Dec 20, 2017 5:55 pm
A logarithmic number system is efficient as long as you are using it for multiplication only, but difficult if you want to do addition. You need no extra hardware for multiplying logarithmic numbers - this is simply addition of integers. Another possibility is to use standard floating point numbers and add the exponents. ForwardCom has an instruction mul_2pow that adds an integer n to the exponent of a floating point number. This corresponds to multiplying by 2^n, or dividing if n is negative. This does floating point multiplication at the speed of integer addition.
Agner, I got the impression that a logarithmic number system benefits greatly from a hardware implementation, like the European Logarithmic Microprocessor. For example:

https://www.ece.ucsb.edu/~parhami/pubs_folder/parh13-asilo-log-arith-as-alt-to-flp.pdf

http://ieeexplore.ieee.org/document/7154603/?reload=true

If it's just integer addition, what are these hardware implementations implementing?

agner
Site Admin
Posts: 58
Joined: Sun Oct 15, 2017 8:07 am
Contact:

Re: Different instruction sets on different cores

Post by agner » Mon Jan 22, 2018 5:53 am

Joe,
In a logarithmic number system, multiplication and division become simpler, but addition and subtraction become much more complicated. Your links confirm this. A program with an equal number of additions and multiplications will be faster on a floating point computer than on a logarithmic processor. Precision is also an issue. An integer can be expressed exactly in a floating point system, but not in a logarithmic system.

Kulasko
Posts: 3
Joined: Tue Nov 14, 2017 9:41 pm
Location: Germany

Re: Different instruction sets on different cores

Post by Kulasko » Tue Jan 23, 2018 7:40 pm

JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
You're right about the server use case. I was thinking of clients – desktop and mobile – where the crypto load is trivial (one webpage every minute?). Though ideally the disk/file system should be encrypted by default, and I'm not sure what kind of load that would generate.

I think it's suboptimal to have the same ISA for servers and clients, and to have such a vast number of instructions supported by every core. I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core. I don't think having a different ISA for servers vs. clients would be much trouble for developers, since developing applications for mobile and desktop is already quite different from server development and most developers don't interact with the ISA directly. 64-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices.
What you are aming at basically is a client specialized architecture. That might be optimal from a pure space/power efficiency point of view, however, it would need twice the development work (OS, compilers etc need to be developed too), it would be binary incompatible with all other kinds of devices and in case of your 40-bit proposal, it could introduce nasty bugs as developers are accustomed to 32 or 64 bit integer and floating point numbers. Also, it might run into the same addressing wall we ran with 32 bit in the early 2000s if it survives for a decade or more.
An ISA is a mere specification, you can vary a lot of things through implementation. For example, you could build a forwardcom processor with 128 bit vectors for clients, one with 256 bits for servers, and one with 8192 bits for scientific computing. In that regard, forwardcom allows very high flexibility. Also, there is no need to implement all instructions efficiently if they are rarely used in the environment you design your processor for. A good part of the more advanced instructions in forwardcom are even fully optional.
JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
Agner seems to think that we can dish a bunch of work to FPGAs, like encryption. But I think it's very unlikely that OEMs will want to include FPGAs in most devices. Maybe he's just thinking of servers, which is more feasible, but even then I don't know that FPGAs will ever be common. An FPGA is going to add to the BOM and cost, and I doubt many customers will be clamoring for them. Very few developers have any experience with FPGAs, and they seem to be niche devices for things like high frequency trading. It's hard enough to get developers to use modern CPU instructions, vectorization, etc.
The current forwardcom proposal integrates an FPGA in every CPU core. It would be possible to supply FPGA programs for different instruction extensions by the OS, so a programmer could use them as if they were a native part of the ISA. However, the speed disadvantage versus a native implementation will remain und might be a critical problem in egde cases. In those cases, an ISA extension might be unavoidable.

-.-
Posts: 4
Joined: Sun Dec 24, 2017 5:10 am

Re: Different instruction sets on different cores

Post by -.- » Wed Jan 31, 2018 12:48 am

JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
I was thinking of clients – desktop and mobile – where the crypto load is trivial (one webpage every minute?). Though ideally the disk/file system should be encrypted by default, and I'm not sure what kind of load that would generate.
True that most clients won't need that much crypto. Could increase a bit in the future (DRM, HTTPS, encrypted communications, I/O etc), though it's still likely not that much. Disks are usually slow enough that they don't have much of an impact on CPU, but with faster storage becomming readily available, this can change too. Modern SSDs are often have built in encryption (self encrypting drives or SEDs), but there can be trust issues with using those (e.g. often insecurely implemented by the manufacturer).
JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
I think it's suboptimal to have the same ISA for servers and clients
It is indeed more optimal to target your chips for the applications running on them, but I agree with Kulasko that there's also a cost to having different ISAs between client/server. There's a reason why the overwhelming majority of servers run x86...
JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core
I don't really write programs which use much FP math, but off the top of my head, I'd imagine that games would make heavy use of FP, along with some media content creation applications, and possibly even web page rendering. A number of scripting languages, such as Javascript, exclusively use 64-bit floats for their number representation (though JIT engines may be able to optimise these into ints).
JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
since developing applications for mobile and desktop is already quite different from server development
Funnily enough, node.js is really hot in the server-side web application development space at the moment - one of its key selling points being that it uses the same language as that used in the browser, and hence, libraries can be shared across the two. (though web devs often change their technology stack every few years, so this may not last)
Unrelated to ISA details, but I just felt like pointing it out anyway. I do agree that development work between clients and servers are generally quite different, not to mention that desktop/mobiles often have an x86/ARM split.
JoeDuarte wrote:
Mon Jan 22, 2018 2:36 am
64-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices
40-bit does sound interesting, and I can't really disagree with not needing any more, but I think in this day and age, not having 64-bit support may be a hard sell. A bunch of code and algorithms (e.g. off the top of my head, SHA512) are already optimised to make use of 64-bit CPUs.
1TB of addressable memory does sound like a potential limitation though. It's not unusual for servers to have this much memory these days, and it's likely it'll become common in clients in the future. Also, if non-volatile memory storage solutions become popular in the future, and OSes see a benefit in mapping disk into RAM, 1TB would definitely be limiting.

Post Reply