Different instruction sets on different cores

HubertLamontagne · Post by **HubertLamontagne** » 2018-02-01, 16:39:33

JoeDuarte wrote:I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core.

Video games use TONS of floating point math. Everything that is a 3d coordinate is going to have 32bit floating point XYZ. If your scene has 10 million vertexes, that's 30 million floating point values just for the coordinates (plus another 30 million for the normals), that need to be processed each frame (= 60 times per second). The whole point of the PS2's "Emotion engine" (customized MIPS + 2 high bandwidth 4x32bit vector units) was to do as many float multiplies and adds as possible. The whole point of the PS3's infamous "Cell" processor was also to do as many float multiplies and adds as possible. The Iphone was pretty much the first cell phone with an FPU and that's exactly when 3d games on cell phones exploded.

Post by **agner** » 2018-02-02, 6:57:27

Hubert wrote:

Video games use TONS of floating point math

ForwardCom has optional support for half precision floating point vectors. Do you think that video and sound applications can use half precision? Neural networks is another application for half precision.

The operand type field in the instruction template has 3 bits giving 23 = 8 types: int8, int16, int32, int64, int128, float32, float64, float128.
As you see, I have given priority to possible 128-bit extensions so there is no space for float16. Instead, half precision instructions are implemented as single-format instructions without memory operand. You need to use int16 instructions for memory read and write.

JoeDuarte · Post by **JoeDuarte** » 2018-02-02, 13:18:11

HubertLamontagne wrote: ↑2018-02-01, 16:39:33
JoeDuarte wrote:I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core.
Video games use TONS of floating point math. Everything that is a 3d coordinate is going to have 32bit floating point XYZ. If your scene has 10 million vertexes, that's 30 million floating point values just for the coordinates (plus another 30 million for the normals), that need to be processed each frame (= 60 times per second). The whole point of the PS2's "Emotion engine" (customized MIPS + 2 high bandwidth 4x32bit vector units) was to do as many float multiplies and adds as possible. The whole point of the PS3's infamous "Cell" processor was also to do as many float multiplies and adds as possible. The Iphone was pretty much the first cell phone with an FPU and that's exactly when 3d games on cell phones exploded.

Would games br better off with logarithmic number system hardware instead of FP?

JoeDuarte · Post by **JoeDuarte** » 2018-02-02, 13:38:18

JoeDuarte wrote: ↑2018-01-22, 2:36:1164-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices

40-bit does sound interesting, and I can't really disagree with not needing any more, but I think in this day and age, not having 64-bit support may be a hard sell. A bunch of code and algorithms (e.g. off the top of my head, SHA512) are already optimised to make use of 64-bit CPUs.
1TB of addressable memory does sound like a potential limitation though. It's not unusual for servers to have this much memory these days, and it's likely it'll become common in clients in the future. Also, if non-volatile memory storage solutions become popular in the future, and OSes see a benefit in mapping disk into RAM, 1TB would definitely be limiting.

I was only thinking of 1 TB for clients, not servers. You might have noticed that RAM on desktops, laptops, and smartphones has virtually hit a wall. There's very little growth at this point. The iPhone has been stuck at 2 GB for years. Premium Android devices usually sport 4-6 GB. High-end laptops are still coming with 8, 12, or 16 GB (some have a theoretical max of 32 GB, and some, like Apple's useless port-free laptops, are capped at 16 GB).

I realize that Bill Gates made that infamous comment about how we'd never need more than 64 KB of RAM or something, but the fact that he was way off doesn't mean that there isn't actualy a number that we'll never need to surpass. It looks like it will be many years before 64 GB is normal in a desktop/laptop, and I don't think it will ever be normal on mobile (unless we're talking the year 2100).

1 TB will be far more than enough for clients for several decades. The only thing I wonder about is how tagged memory will work with a 40-bit address space. Would it require more bits? I've been fascinated by the CHERI CPU project and ISA: http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

And the TAXI proposal: https://people.csail.mit.edu/hes/ROP/Pu ... thesis.pdf

HubertLamontagne · Post by **HubertLamontagne** » 2018-02-02, 18:11:06

I've never seen half-float being used. Not in game code, and not in sound applications (where 32bit float is very much the sweet spot). There's very little x86 support - only AVX conversion instructions to and from 32bit float vectors (vcvtps2ph and vcvtph2ps). There is no standard C/C++ type name for it either (the only trace of half float on x86 is the AVX conversion intrinsics).

JoeDuarte · Post by **JoeDuarte** » 2018-02-03, 2:14:44

HubertLamontagne wrote: ↑2018-02-02, 18:11:06 I've never seen half-float being used. Not in game code, and not in sound applications (where 32bit float is very much the sweet spot). There's very little x86 support - only AVX conversion instructions to and from 32bit float vectors (vcvtps2ph and vcvtph2ps). There is no standard C/C++ type name for it either (the only trace of half float on x86 is the AVX conversion intrinsics).

Hi Hubert, I've lost the plot here a bit. Why are you talking about half-float? Is this related to my enthusiasm for 40-bit registers and address spaces for client devices? How so?

In any case, half-float, by which I assume you mean 16-bit FP, is extremely relevant right now, much more so than it was even ten years ago. It features prominently in a lot of deep learning APIs and platforms, most recently in NVIDIA's new Volta "GPU" architecture with its plethora of dedicated tensor cores (I put "GPU" in quotes because this product is no more a GPU than my rooftop antenna – it's meant exclusively for data centers, particularly for deep learning applications. Perhaps one day the Volta architecture will be spun into a GPU, and one can even dream that cryptocurrrency miners won't make it impossible to actually buy these "GPUs" for ≤ 120% of their MSRP.)

Since interesting expansions on the 16-bit FP renaissance:

https://devblogs.nvidia.com/mixed-preci ... ng-cuda-8/

Facebook's Caffe2 platform: https://caffe2.ai/blog/2017/05/10/caffe ... pport.html

Deep dive into Volta: https://devblogs.nvidia.com/inside-volta/

With my proposed 40-bit platform, I imagine specifying 20, 40, and 80-bit integers and FP. I think 20-bit integers and floats would be more useful in many cases than 16-bit. And the 80-bit floats perfectly sync up with the 80-bit Extended Precision FP that IEEE sort of documents already. I think Intel uses 80-bit floats when doing math on doubles. The 20, 40, and 80 bit floats would have to be very rigorously specified, much like the recent IEEE specs (but it should be free and open source, not cost an arm and a leg like the IEEE standards or the C++ standard).

There's also the new ISO/IEC standard which is much broader than floating point: https://en.wikipedia.org/wiki/ISO/IEC_10967

And I'd want a logarithmic number system IF the requisite empirical research tells us that it would be a significant benefit for many programs. (And yes, we'd have to sort out what we mean by "significant" and "many" and so forth.)

I assume a 20/40/80-bit platform could easily support legacy 16/32/64-bit types by padding or other means.

I also like the idea of 320-bit vector registers. 8 40-bit values. 10 32-bit values. 4 80-bit. From what I've read, I'm not sure that huge vectors of the sort Agner wants are efficient. Isn't AVX-512 underperforming right now?

Finally, I think core type bit lengths, register sizes, address space, vector length, etc. should all be chosen by rigorous empirical research on what is optimal for the kind of operating system we want (and we really should want new, clean-sheet OSes), and the applications we expect to run on them. My 20/40/80 business is really just a hunch of near optimality for client devices. But the optimal values could be quite different, and innovations in semiconductor manufacturing and hardware design could enable a whole new set of optimal parameters.

-.- · Post by **-.-** » 2018-02-03, 9:11:44

JoeDuarte wrote: ↑2018-02-02, 13:38:18 You might have noticed that RAM on desktops, laptops, and smartphones has virtually hit a wall. There's very little growth at this point. The iPhone has been stuck at 2 GB for years. Premium Android devices usually sport 4-6 GB. High-end laptops are still coming with 8, 12, or 16 GB (some have a theoretical max of 32 GB, and some, like Apple's useless port-free laptops, are capped at 16 GB).

This does seem to be the case. I'd say there's not really a need for more RAM on most client machines. The other side is likely the RAM pricing in recent times due to supply shortages.
The iPhone probably has its own reasons for its limitations, also Intel limits its client CPUs to 32-64GB RAM, presumably to stop people running servers on them. Hence, these may also be factors.

However I suspect this "wall" is mostly just an economical one, not a technical one. 128GB DIMMs are available now, and consumer motherboards with 4-8 DIMM slots are not uncommon. It's hard to guess economic conditions 10-20 years down the track, and I don't think an ISA should be making such heavy bets about it.

JoeDuarte wrote: ↑2018-02-02, 13:38:18 It looks like it will be many years before 64 GB is normal in a desktop/laptop

I'd agree with that, but I think an ISA should firstly consider the requirements of its high end users (since if it can, it'll work with common users). I do know a few workstations (for multimedia processing) which have 64GB RAM installed.

My personal home computer has 32GB RAM installed. This guy's personal desktop has 128GB RAM installed [ https://jmvalin.dreamwidth.org/15583.html ].

HubertLamontagne · Post by **HubertLamontagne** » 2018-02-07, 18:42:58

Anyhow, the most common customizations you see on CPUs are for making smaller embedded versions. The most common configurations you see in the wild:

- 32bit, no MMU, no FPU, no SIMD: fast micro-controller. Truckloads of ARM and MIPS use this, such as STM32's which are taking over the hardware world.
- 32bit, FPU: fast micro-controller, very useful if you need lightweight systems that do DSP processing (ex: guitar effect pedals) (larger STM32's).
- 32bit, MMU: small cpu that runs complex OS's such as Linux. Lots of first generation Android phones used this, plus routers etc.
- 32bit, MMU+FPU: the iPhone configuration. Does both complex OS and DSP. Classic configuration that has broad applicability to lots of software.
- 64bit and SIMD and virtualization-support are added to the 32+MMU+FPU config, 64bit to address large amounts of RAM, SIMD to boost DSP perf

JoeDuarte · Post by **JoeDuarte** » 2018-02-23, 6:02:31

agner wrote: ↑2018-02-02, 6:57:27 Hubert wrote:
Video games use TONS of floating point math
ForwardCom has optional support for half precision floating point vectors. Do you think that video and sound applications can use half precision? Neural networks is another application for half precision.

The operand type field in the instruction template has 3 bits giving 23 = 8 types: int8, int16, int32, int64, int128, float32, float64, float128.
As you see, I have given priority to possible 128-bit extensions so there is no space for float16. Instead, half precision instructions are implemented as single-format instructions without memory operand. You need to use int16 instructions for memory read and write.

Hi Agner, I think 16-bit FP has actually become more popular in recent years, so not treating it as a first-class citizen may be a mistake. It's not just popular in games, but in deep learning applications and imaging. GPU makers have intensified their support for it lately, and nVidia's new tensor cores center on it. Google's TensorFlow ASICs also depend on it, I think. Apparently 16-bit is optimal for deep learning because it offers the right compromise of precision and speed. Now, you could say all this stuff can be handled by GPUs, not a CPU instruction set, but there's evidence that 16-bit will be used a lot by CPUs, like the introduction of the F16 instructions for conversion, and the fact that 16-bit FP is used in some imaging formats for High Dynamic Range. Imaging won't always be offloaded to GPU – in fact, right now it rarely is on desktop platforms. You can see some of the formats that depend on 16-bit here: https://en.wikipedia.org/wiki/Half-prec ... int_format

ImageMagick even releases special versions that support 16-bit per channel formats. I don't know if it's integer or FP, but I think it's the latter since they mention OpenEXR: http://imagemagick.org/script/download.php#windows

forwardcom forum

Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores

Re: Different instruction sets on different cores