Forwardcom and caching models

JoeDuarte · Post by **JoeDuarte** » 2018-05-19, 18:31:39

Hi Agner,

Is forwardcom hooked to any particular caching model or hierarchy? I see references to an instruction cache and data cache, but not much else.

Would forwardcom be fundamentally compatible with architectures like Rex Computing, which discards the traditional caching hierarchy in favor of a massive number of cores with 128 KB of scratchpad memory per core? http://rexcomputing.com/

Post by **agner** » 2018-05-20, 5:10:35

ForwardCom could use any caching model. Experiments with alternative forms of caching are welcome.

I don't think that 128 kB would be enough if you have large vector registers, but the cache might be subdivided into 'lanes' that align with the data lanes of the CPU. Quoting from the manual:

The ForwardCom design allows large microprocessors with very long vector registers. This requires special design considerations. The chip layout of vector processors is typically divided into “data lanes” so that the vertical transfer of data from a vector element to the corresponding vector element in another vector (i. e. same lane) is faster than the horizontal transfer of data from one vector element to another element at another position of the same vector (i. e. different lane). This means that instructions that transfer data horizontally across a vector, such as broadcast and permute instructions, may have longer latencies than other vector instructions.

In other words, I would not divide the chip into a large number of independent cores, but into 'data lanes' that all execute the same instruction on each their part of a long vector. A read or write of a full-length vector would access all the lanes in parallel, while the read or write of a scalar (single value) should be able to access any of the lanes of the cache system.

HubertLamontagne · Post by **HubertLamontagne** » 2018-05-23, 15:22:50

JoeDuarte wrote: ↑2018-05-19, 18:31:39 Would forwardcom be fundamentally compatible with architectures like Rex Computing, which discards the traditional caching hierarchy in favor of a massive number of cores with 128 KB of scratchpad memory per core? http://rexcomputing.com/

That architecture reminds me a lot of the PS3's Cell processor, and looks a lot like a type of DSP. I'm not sure it makes sense to have an ISA that targets both that kind of system and general purpose CPUs, the the difference is kinda wide and they're programmed differently (parallel calculation-oriented languages like OpenCL vs general purpose like C++ and high level programming languages).

marioxcc · Post by **marioxcc** » 2018-06-05, 15:02:47

The common wisdom is that common and simple tasks should be implemented in hardware to make it faster and more energy-efficient. Traditional caches are the consequence of applying this principle to memory management. Scratch-pad memory is software-managed cache. Vendors of cheap processors use it to shift the burden of cache management to software developers. For the same IC area, you can provide a bigger scratch-pad memory than a cache, but any benefit of the increase in size is lost to the overhead of managing cache in software. I think that even direct-mapped cache (which requires little logic beyond the SRAM array) is preferable to scratch-pad memory because although you still have to design software around memory access patterns, you avoid the overhead of explicit cache management (i.e.: moving to/from scratch-pad memory).

JoeDuarte · Post by **JoeDuarte** » 2018-06-07, 7:07:20

HubertLamontagne wrote: ↑2018-05-23, 15:22:50
JoeDuarte wrote: ↑2018-05-19, 18:31:39 Would forwardcom be fundamentally compatible with architectures like Rex Computing, which discards the traditional caching hierarchy in favor of a massive number of cores with 128 KB of scratchpad memory per core? http://rexcomputing.com/
That architecture reminds me a lot of the PS3's Cell processor, and looks a lot like a type of DSP. I'm not sure it makes sense to have an ISA that targets both that kind of system and general purpose CPUs, the the difference is kinda wide and they're programmed differently (parallel calculation-oriented languages like OpenCL vs general purpose like C++ and high level programming languages).

So the Cell just sort of faded away over time? It seemed like real innovation in hardware and ISAs, which is rare. By the way, the Itanium is looking sweeter than ever given its immunity to Meltdown and Spectre. I feel like the industry needs to move forward at faster clip.

JoeDuarte · Post by **JoeDuarte** » 2018-06-07, 7:11:35

marioxcc wrote: ↑2018-06-05, 15:02:47 The common wisdom is that common and simple tasks should be implemented in hardware to make it faster and more energy-efficient. Traditional caches are the consequence of applying this principle to memory management. Scratch-pad memory is software-managed cache. Vendors of cheap processors use it to shift the burden of cache management to software developers. For the same IC area, you can provide a bigger scratch-pad memory than a cache, but any benefit of the increase in size is lost to the overhead of managing cache in software. I think that even direct-mapped cache (which requires little logic beyond the SRAM array) is preferable to scratch-pad memory because although you still have to design software around memory access patterns, you avoid the overhead of explicit cache management (i.e.: moving to/from scratch-pad memory).

What would it take to make our DDR4 bus go away and replace it with a giant L3 cache? Could we make L3 (or L4) a multi-GB thing? What about with stacked or adjacent memories like HBM2 or Hybrid Memory Cube?

HubertLamontagne · Post by **HubertLamontagne** » 2018-06-07, 22:08:15

JoeDuarte wrote: ↑2018-06-07, 7:07:20 So the Cell just sort of faded away over time? It seemed like real innovation in hardware and ISAs, which is rare.

The Cell hasn't faded away. Its major feature (multiple cores can be used for lots of throughput) has been applied to x86 and ARM, and 4+ core CPUs are now common. The major benefit is that your extra cores can now actually read/write from RAM for real, and can deal with any general purpose mishmash code instead of being limited to only vectorized numerical code.

HubertLamontagne · Post by **HubertLamontagne** » 2018-06-07, 23:40:51

JoeDuarte wrote: ↑2018-06-07, 7:07:20 By the way, the Itanium is looking sweeter than ever given its immunity to Meltdown and Spectre. I feel like the industry needs to move forward at faster clip.

For Itanium, I'm going to say something controversial here: it's actually a worse architecture than x86 (!).

Itanium's idea is that the compiler is going to schedule instructions beforehand, and that the silicon you're going to save will be used for more execution units. This just doesn't work in practice. Every time a program loads a value from memory, the compiler has to guess from a crystal ball if it's going to come from L1 cache (in which case it schedules dependent instructions close to minimize latency), or if it's going to come from L2+ cache or ram (in which case it needs to buffer in more instructions between to avoid a stall). This decision has to be taken by the compiler and is set in stone and cannot vary dynamically as the program runs.

There's an extra problem: typical code will do a bunch of calculations, then store, then do a second load of some other data item for more calculations etc. To keep the CPU busy, the compiler has to move the second load before the store, but then it needs to add a safeguard that the second load doesn't fall on the same memory address as the store (otherwise you get all sorts of crashes). Because of this, Itanium needs a special second kind of load that locks the memory address, and a special store that checks for locks, and management instructions for the lock table.

A third problem is that maybe your second load needs to be hoisted before a conditional branch. Then you have the problem that the branch might be taken or not, and your load might never really happen in the program logic. This is not normally a problem, except that your hoisted load might trigger a page fault. So Itanium has yet a THIRD special kind of load that doesn't page fault directly, but instead loads a poison value if the load was going to fault. Then all your registers need an extra bit to indicate this poison value, plus management instructions for saving/restoring this (for interrupts etc).

A fourth problem is that now that you have long latency instruction series explicitly scheduled, you get the problem that loops have to be overlapped over more than one iteration (since it takes more time to get the results than to start the next iteration). Since the first and last iteration are partial, you'd end up with long prologues and epilogues, so to help with this, Itanium also has predication - basically, every instruction can be made conditional, so that you can easily run only parts of the loop the first and last time around. And since multiple overlapped iterations need to operate on different values, Itanium tacks on a register rotation engine, that can dynamically rotate the names of a bunch of registers and is also used to automatically spill/refill registers to the stack on function calls.

---

If this is starting to sound complex, that's because it absolutely is. It's basically most of the machinery of an OOO core, but explicitly exposed to the programmer.

And if you try to implement Itanium in an Out of Order core, you run into the problem that all your instructions are conditional, all your instructions can generate an interrupt (because of the poison values), your load instructions don't only load but also poke bits here and there in this memory aliasing table, the register rename system can randomly issue a large blob of memory loads/stores, you have tons of short lived memory address values hammering your register file write ports since it doesn't even have the [register+immediate] addressing mode, plus it has a [register]+postincrement addressing mode which turns memory ops into two-result instructions (= probably has to be split into 2 micro-ops!), plus you have to deal with the possibility of rollbacking the register renamer + the poison bits + the memory aliasing table if a branch goes the other way than predicted.

In other words, Itanium doesn't even have x86's saving grace: that you can throw an oversized instruction decoder and a flags register engine at it and mercifully split all the braindead 80's instructions into actually reasonable micro-ops. :3

JoeDuarte · Post by **JoeDuarte** » 2018-06-13, 1:37:00

Hubert, what happens if the second workload is performed on a separate core? For EPIC, I'd assume lots of cores.

Separately, does the Mill CPU solve some of these problems?

I'm hazy on the details -- didn't the imdepotent processor also address these? I know it's not actually available as a product right now, but I'm interested in concepts too. (https://dl.acm.org/citation.cfm?id=2155637)

What I'd like to see is instant computing, where almost everything we do on a computer happens instantly. The exceptions would be things like video transcoding. The loads in a GUI OS would be big chunks of UI display, possibly pre-rasterized, and user-specific data. Given known SSD latencies, we should already have instant computing, where every application opens instantly, fully ready for use, but for some reason we don't get instant computing with existing OSes. (In fact, hard drive latencies seem low enough to deliver instant computing -- we should be able to get things in a fraction of a second.)

HubertLamontagne · Post by **HubertLamontagne** » 2018-06-15, 22:24:50

JoeDuarte wrote: ↑2018-06-13, 1:37:00 Hubert, what happens if the second workload is performed on a separate core? For EPIC, I'd assume lots of cores.

Compilers can't generally figure out how to move calculations to a second thread - they can't track all the potential side effects. This applies not only to C++, but also pretty much all higher level languages like Java and Javascript and whatnot.

JoeDuarte wrote: ↑2018-06-13, 1:37:00 Separately, does the Mill CPU solve some of these problems?

Dunno, it hasn't been released yet so we don't really know how well it performs. From the little I know, I don't really see how it would give good performance on something complex (ex: running a CPU emulator), but you never know.

JoeDuarte · Post by **JoeDuarte** » 2018-07-02, 21:49:35

HubertLamontagne wrote: ↑2018-06-15, 22:24:50
JoeDuarte wrote: ↑2018-06-13, 1:37:00 Hubert, what happens if the second workload is performed on a separate core? For EPIC, I'd assume lots of cores.
Compilers can't generally figure out how to move calculations to a second thread - they can't track all the potential side effects. This applies not only to C++, but also pretty much all higher level languages like Java and Javascript and whatnot.

But could compilers be made to have such abilities? Forget the past -- what about now?

One thing I'd like to see is the application of supercomputers and HPC clusters to compilation of normal applications -- desktop, server, mobile, games, etc. What could we achieve if we had massive computing resources dedicated to a compiler? It's strange to me that compilers are still limited desktop applications that we expect to run an any random laptop. I'd rather have a superoptimizing compiler, sort of like STOKE (http://stoke.stanford.edu/), running on a supercomputer in the cloud -- call it Compiler as a Service (CaaS).

Post by **agner** » 2018-07-03, 5:19:59

But could compilers be made to have such abilities? Forget the past -- what about now?

Yes. For example the Intel compiler can put prefetching of data into a separate thread running in the same CPU core with simultaneous multithreading (= hyperthreading in Intel lingo).

I generally don't like simultaneous multithreading because a low priority thread can steal resources from a high priority thread running in the same core. It is usually the operating system that distributes threads among cores, and the OS doesn't know what resources each thread needs, and what resources the CPU shares between threads.

Communication and synchronization between threads running in different cores is slow and complicated. Therefore it is not advantageous to split a job between multiple cores unless it can be divided into sufficiently big independent chunks.

There is no need to run the compiler on a supercomputer. Only the compiled code needs to run on a supercomputer to get the maximum performance.

HubertLamontagne · Post by **HubertLamontagne** » 2018-07-05, 3:17:45

JoeDuarte wrote: ↑2018-07-02, 21:49:35
HubertLamontagne wrote: ↑2018-06-15, 22:24:50 Compilers can't generally figure out how to move calculations to a second thread - they can't track all the potential side effects. This applies not only to C++, but also pretty much all higher level languages like Java and Javascript and whatnot.
But could compilers be made to have such abilities? Forget the past -- what about now?

There's a limit to the compiler's crystal ball to read into the future... If your code runs function A then function B, and both access the global state (through memory references or other devices) and the functions aren't simple enough that alias analysis can work, how can the compiler ever know that the functions will or won't access the same memory and must run serially or can be parallelized?

marioxcc · Post by **marioxcc** » 2018-07-17, 16:11:38

JoeDuarte wrote: ↑2018-06-07, 7:11:35 What would it take to make our DDR4 bus go away and replace it with a giant L3 cache? Could we make L3 (or L4) a multi-GB thing? What about with stacked or adjacent memories like HBM2 or Hybrid Memory Cube?

It can’t be done. A “giant L3 cache” would not longer behave as you would expect from a L3 cache. The main feature of CPU cache is that it has very low latency. To be low latency, it uses SRAM and is small. If you make the L3 cache as big as main memory, it will be very slow (latency-wise, not necessarily bandwidth-wise) despite if you keep using SRAM. On-chip memory like HBM2 allow a higher bandwidth because you can have more interconnect wires, but the latency is only marginally better than off-chip DRAM.

The memory hierarchy is here to stay. As memory gets bigger, it also gets slower. So called “3D fabrication” will reduce, but not eliminate this effect because it is a physical limitation: The bigger the memory, the more space it must use, and therefore the bigger the distance that has to be traveled to access the farthest bit. As distance get bigger, latency will necessarily increase because of capacitance of wires and the finite propagation speed (which is significantly lower than the speed of light in both copper and optical cables; only wireless networks operate at practically the speed of light, but they have problems of their own).

JoeDuarte · Post by **JoeDuarte** » 2018-07-19, 6:30:51

HubertLamontagne wrote: ↑2018-07-05, 3:17:45
JoeDuarte wrote: ↑2018-07-02, 21:49:35
HubertLamontagne wrote: ↑2018-06-15, 22:24:50 Compilers can't generally figure out how to move calculations to a second thread - they can't track all the potential side effects. This applies not only to C++, but also pretty much all higher level languages like Java and Javascript and whatnot.
But could compilers be made to have such abilities? Forget the past -- what about now?
There's a limit to the compiler's crystal ball to read into the future... If your code runs function A then function B, and both access the global state (through memory references or other devices) and the functions aren't simple enough that alias analysis can work, how can the compiler ever know that the functions will or won't access the same memory and must run serially or can be parallelized?

Have you seen what STOKE can do? It doesn't address your specific cases, but it optimizes the hell out of code, which is why it's called a SuperOptimizing Compiler. We're leaving a huge number of optimizations on the table with current C/C++ compilers. Hardly anyone uses LTO properly, or PGO, or lets the compiler run over the weekend... Compilers should be much smarter. For example, they should know that the application is, for example, a web server, and should know a few important things about web servers and their hot spots. With the right annotations in the code, we could get the benefits of PGO without having to run PGO. And there are a hundred other things an HPC cluster could do for us in a few minutes that would take gcc a week on an i7-8700K.

HubertLamontagne · Post by **HubertLamontagne** » 2018-07-20, 17:51:22

JoeDuarte wrote: ↑2018-07-19, 6:30:51 Have you seen what STOKE can do? It doesn't address your specific cases, but it optimizes the hell out of code, which is why it's called a SuperOptimizing Compiler.

I just took a look. It looks impressive for a research project. Though, from what I gather, it's geared towards optimizing numerical code, and Itanium is already good at numerical code. With some more research (adding the ability to process loops!) it could definitely be interesting for numerical processing application such as signal processing code on VLIW DSPs. Maybe the approach could also work well for optimizing Verilog code for FPGAs.

JoeDuarte wrote: ↑2018-07-19, 6:30:51We're leaving a huge number of optimizations on the table with current C/C++ compilers. Hardly anyone uses LTO properly, or PGO, or lets the compiler run over the weekend...

A lot of these optimizations don't change the speed :3

A modern CPU can do 4 instructions per cycle, and typical CPU intensive code runs somewhere between 1..4 instructions per cycle. The overwhelmingly most common reason that code isn't reaching 4 instructions per cycle is memory and cache bottlenecks. If you're in this (very common) situation, it doesn't mater how many combinations of bitshifts and SSE instructions the compiler tries, because you're still going to wait the same amount of time for memory loads/store to do their job.

A lot of the reason Itanium failed is that the architecture just isn't good at all at dealing with this kind of memory/cache bottleneck. On x86, the compiler can simply let the memory load/stores in program order and hope for the best. To get exactly the same performance on Itanium, the compiler has to explicitly turn some load/stores into speculative loads and advanced loads - with the result that for load/store heavy programs, if the compiler guesses right you merely get the same performance as x86, and if the compiler guesses wrong you could get a lot worse.

(That being said, you're right that more people could use LTO on C++ for final builds.)

JoeDuarte wrote: ↑2018-07-19, 6:30:51Compilers should be much smarter. For example, they should know that the application is, for example, a web server, and should know a few important things about web servers and their hot spots. With the right annotations in the code, we could get the benefits of PGO without having to run PGO. And there are a hundred other things an HPC cluster could do for us in a few minutes that would take gcc a week on an i7-8700K.

This is exactly what is IMPOSSIBLE to do in a compiler!

Compilers are great at LOCAL optimization, and can suss out a zillion valid reconfigurations of a small localized piece of code that doesn't touch too much memory. But once the medium and large scale picture come in, it quickly becomes IMPOSSIBLE for the compiler to reason about what can or can't happen. It's not that we aren't trying or aren't throwing enough CPU compile time at the problem - it just can't be done!

forwardcom forum

Forwardcom and caching models

Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models

Re: Forwardcom and caching models