Using Forwardcom as a GPU?

HubertLamontagne · Post by **HubertLamontagne** » 2020-03-21, 22:25:52

Considering how vector and throughput-oriented Forwardcom is, I've been wondering if it would make sense as a GPU. It should be pretty good at vector processing at least. It might make sense to use tiled rendering. For rasterization, you'd use the full vector register size all the time, with register masking being used for each pixel operation (setting the register mask when rasterizing the triangle contour and with the Z-test). A nice thing about Forwardcom is that it has a vectorized version of more or less all the base arithmetic operation and many broadly useful operations like bit reversing.

The one GPU thing that looks like it would be a challenge would be texture decompression, mapping and filtering. Larrabbee did this with a separate texture mapping unit, and x86 CPUs got vbroadcast for this kind of use (not sure how it behaves in practice). Afaik GPUs do it kindof like the Larrabbee (with separate texturing units and switching between threads to hide the latency of the texture load), with many intermediate size GPUs loading 1 textured pixel per cycle per texture unit (on mobile GPUs and on the PS3, which had 8 texture units).

Trying to use a CPU as a GPU is difficult but it's still an interesting design challenge. Any ideas or thoughts about this?

Post by **agner** » 2020-03-22, 5:36:31

I have never worked with graphic processing, but this is certainly worth doing. You could either use the FPGA feature to make custom instructions or make it hard coded. Do the instructions have long latency?

Kulasko · Post by **Kulasko** » 2020-03-23, 6:16:15

I'm actually interested in this as well.
I would like to build a system with cpu cores with different vector length, a scheduler that is aware of that and "driver" to do graphics acceleration. I would even be happy with 2D acceleration only.
It also would be especially interesting for heterogenous workloads. One could have a high power, high clocked, low latency core for serial parts of a program and a low power, low clocked, high latency but high throughput core for the parallel parts. Assuming the cores share cache at some level, the operating system could reschedule the thread between the cores depending of the data parallelization in the code. The HSA foundation, which AMD pushed over the last 5-6 years, tried to do basically this, but with GPUs and other coprocessors. Considering the low overhead Forwardcom has for context switches, this could work fairly well.

TimID · Post by **TimID** » 2020-03-24, 20:33:16

It sounds like an interesting project, but I'm not sure it would be possible to make something competitive with dedicated GPUs. GPUs tend to get an order of magnitude more flops for the same part cost as a CPU by making an assumption about the type of work they will have to do. Whilst forwardcom has some really interesting ideas for streamlinging and improving many parts of a CPU, GPUs tend not to have those parts at all.

To give a brief overview of the differences between a normal CPU and a GPU, the assumption that GPUs make is that:

The user will want to run thousands to millions(1) of instances of each program concurrently, and those programs will be relatively simple.

These conditions allow a very useful optimisation, you can remove all of the circuitry that reduces pipeline stalls in CPUs because when a thread is stalled you can just switch it for one that isn't. In general there will be enough threads that you never run out of non-stalled ones. You then use all of the transistors that would have been dedicated to reducing stalls to do more maths. Some of the consequences of this are:

No branch prediction - if you don't know which branch to take stall the thread and move to a new one
No OoO - if you don't have the data for an instruction yet stall the thread and switch to a new one
No vector instructions(2) - we can't afford to rely on the programmer writing vectorizable code so instead we run each thread as scalar code and group threads together (currently in groups of 16 on NVIDIA hardware and I think the same on AMD) and run one scalar operation from each thread as a lane in a vector
Very fast context switches required - I believe it is possible to switch context on every single clock cycle on modern GPUs, although I don't know for sure
No stack - for context switches to be fast enough we can't afford to write the current context back to a stack, this means functions in GPU languages are always inlined by the compiler and recursion isn't possible
Very large register file - even writing the context to L1 would be too slow, instead we have to keep many contexts in the register file at once. On NVIDIA's Turing architecture the register file for each Processing Block(3) is 64KB and four of these blocks share a single 96KB L1 cache.

The other big difference is in main memory access, GPUs tend to have very wide accesses with long latencies. The current high-end GPUs use a 256bit data bus and have image processing ASICs built into the memory managers. They also store data in proprietary swizzled formats, although I don't know much about that because the manufacturers don't publish much about how they work. This takes advantage of the fact that we expect our access to be very coherent, if one pixel needs to know about the colour of a texture at (5,3) there's a really good chance that the next pixel will need to know about (5,4).

I'm sure a forwardcom core could be designed to work in this environment, but I think it would be very different from the CPU version, I also think it would need a limited sub-set of the full instruction set. Like I said at the start though it would be an interesting project and there might be other niches for highly parallel but not quite so restricted environments where a forwardcom based solution would be optimal.

1) For example working out what colour a given pixel should be, on a full HD display there are ~2million pixels so the same program runs two million times with different input co-ordinates.
2) There actually are vector instructions in high-level GPU languages, but these have at most four components and are aimed at simplyfying geometry calculations. They are executed serially on a component-by-component basis.
3) Vaguely analogous to a CPU core, it has a vector fpu, vector alu (both 16 lanes of 32 bits), a scheduler, a register file, a pair of ASICs for neural network matrix operations and a set of load/store units. Four of these combined with an L1 cache, four of the image processings ASICs I mentioned before and a ray-tracing ASIC make up a Streaming Multiprocessor, and 32-64 SMs make up a Turing architecutre GPU, along with 6-8GB of GDDR memory and some ASICs for writing raster data into a memory buffer and others for turning that into a signal that monitors can understand.

HubertLamontagne · Post by **HubertLamontagne** » 2020-03-26, 12:27:20

For sure. I have no illusions - such a project would be likely to turn out like the ill-fated Larrabbee (the articles about its demise are confusing, but they seem to imply that it had something like half the perf of dedicated GPUs, with the drivers still in alpha stage as another generation of GPUs were coming in).

Like, maybe I'm wrong here, but to me, Forwardcom looks designed for heavily vector-oriented in-order cores. So applications like DSPs, AI and GPUs seem like they would be a natural fit, no? And isn't that more or less what the Larrabbee/Xeon Phi is? (with the added emphasis on large number of cores, and heavy use of hyper threading on cores to fill pipeline bubbles)

You'd definitely need a very high bandwidth memory architecture to pull of the GPU thing, which I'm sure would require all sorts of tradeoffs. But Forwardcom already has this problem, with the emphasis on large vectors (although you'd also have to add texture units, which would definitely mess things up).

Forwardcom emphasizes large vector processing so much, I just thought it would make sense to look at heavily numerical applications since they would probably be a natural fit.

TimID · Post by **TimID** » 2020-03-28, 22:23:33

Yeah, I definitely think a forwardcom Xeon Phi would be a good proposition. All of the things that are good about the Xeon Phi plus better vector processing and no need for the ridiculous x86 front end.

I'm also unable to get a clear picture of why Larrabee failed, other than it didn't perform well enough. Best guess would be that the bandwidth needed for cache coherence was insurmountable. The Ryzen Threadripper CPUs seem to have solved this problem but I haven't been keeping up with processors for a while so I have no idea how.

This thread has given me another interesting idea, could some GPU techniques be usefully applied to a forwadcom CPU? In particular I wonder if having in-order cores with fast context switching to fill bubbles would be optimal for some niche, either by being cheaper or by using the die area of the OoO machinery for extra cores?

HubertLamontagne · Post by **HubertLamontagne** » 2020-03-31, 17:14:50

Yeah. The idea of a Forwardcom Xeon-Phi / Larrabbee makes sense to me, especially since it plays to Forwardcom's strengths (lots and lots of vector instruction). Clearly you could go with the UltraSparc way - make it in-order to make the cores small, use really aggressive hyper-threading with lots of threads per real core to fill in all the pipeline bubbles, put lots of cores per chip. Forwardcom has lots of powerful but complex vector instructions that would probably have high latency and probably multiple sub-steps (variable sized vector load/store etc). You could easily deal with that by starting the complex operation, and then switching to the next thread.

With Forwardcom's specialization around vector instructions and going for optimizing for throughput over latency, clearly it makes sense to target calculation-heavy applications like GPUs, 3d rendering, AI, simulations, DSP, etc. I tend to think that every CPU architecture needs a niche - be it mobile (ARM), heavily threaded servers (Sparc), high power embedded such as consoles (MIPS before ARM/POWER/x86 ate their lunch), tight DSP loops (DSPs), compatibility back to Mathusalistic eras (x86), tiny embedded controllers (PIC), thoughput over everything else (GPUs), and so forth. And I think Forwardcom needs to look at what types of use it's targetting.

Kulasko · Post by **Kulasko** » 2020-04-01, 16:02:18

I also agree with the idea of a heavily threaded in-order core with large vectors being a good fit for ForwardCom. Depending on the workload, one could even use the flexibility of variable size vectors and balance core count and vector size for an implementation.

However, I disagree with the thought of ForwardCom optimizing throughput over latency. ForwardCom certainly has many features for optimizing throughput, especially considering its vector support. However, in my optinion, these features seldom trade latency for it.

Instructions like restoring a vector might be an exception, but if you have short vectors and really want to optimize for latency, you can omit the variable length in your implementation and just load the maximum vector length, saving the second load.
As an example of latency optimization, Forwardcom instructions only have one register result at maximum (potentially limiting information throughput) because it puts less stress on data dependency management, enabling more aggressive out-of-order operation.

In my mind, ForwardCom is a fairly flexible high power, high performance instruction set architecture. As it isn't meant to be highly specialized, its targets could range from anything higher end ARM, POWER and x86 already serve to anything highly parallel like GPUs, some DSPs and even moderatly specialized use cases, depending how the FPGA part of the specification evolves. This flexibility with specialized cores, software compatibility between them and fast context switches could make ForwardCom especially interesting for heterogenous workloads.

JoeDuarte · Post by **JoeDuarte** » 2020-04-29, 14:21:37

This reminds me that fast Bezier curve performance would be very useful, without having to worry about being a traditional GPU. Bezier curves are central to a lot of 2D rendering, including fonts and vector graphics like SVG. Some are quadratic and some are cubic.

HubertLamontagne · Post by **HubertLamontagne** » 2020-05-01, 22:58:36

JoeDuarte wrote: ↑2020-04-29, 14:21:37 This reminds me that fast Bezier curve performance would be very useful, without having to worry about being a traditional GPU. Bezier curves are central to a lot of 2D rendering, including fonts and vector graphics like SVG. Some are quadratic and some are cubic.

I'm looking into this and it seems that those kinds of renderers typically simply turn curves into 2d polygons with a lot of points. Evaluating the curve into x/y points is already fairly efficient (and can be done with SIMD). The really tricky part seems to be rasterizing the shape from a long list of floating point x/y coordinates, into something like the anti-aliased pixel contour in a 8bit greyscale bitmap or an oversampled 1bit bitmap.

It's true that you can't really do this too efficiently on 3d accelerators (they only do triangles and they don't respond nicely to things like diagonal sliver polygons).

JoeDuarte · Post by **JoeDuarte** » 2020-05-02, 12:55:17

HubertLamontagne wrote: ↑2020-05-01, 22:58:36
JoeDuarte wrote: ↑2020-04-29, 14:21:37 This reminds me that fast Bezier curve performance would be very useful, without having to worry about being a traditional GPU. Bezier curves are central to a lot of 2D rendering, including fonts and vector graphics like SVG. Some are quadratic and some are cubic.
I'm looking into this and it seems that those kinds of renderers typically simply turn curves into 2d polygons with a lot of points. Evaluating the curve into x/y points is already fairly efficient (and can be done with SIMD). The really tricky part seems to be rasterizing the shape from a long list of floating point x/y coordinates, into something like the anti-aliased pixel contour in a 8bit greyscale bitmap or an oversampled 1bit bitmap.

It's true that you can't really do this too efficiently on 3d accelerators (they only do triangles and they don't respond nicely to things like diagonal sliver polygons).

Look up nvpath. It's Nvidia's feature that accelerates vector graphics on their GPUs/3D accelerators. It's a very interesting extension, and Adobe has used it to good effect in their Creative Cloud applications, probably Photoshop and others.

HubertLamontagne · Post by **HubertLamontagne** » 2020-05-04, 17:25:56

JoeDuarte wrote: ↑2020-05-02, 12:55:17
HubertLamontagne wrote: ↑2020-05-01, 22:58:36[...]
Look up nvpath. It's Nvidia's feature that accelerates vector graphics on their GPUs/3D accelerators. It's a very interesting extension, and Adobe has used it to good effect in their Creative Cloud applications, probably Photoshop and others.

I've taken a look at it. It's kinda weird but it makes sense?

From what I can tell, it's still basically a fancy polygon renderer:
- It uses vertex shaders (or a close equivalent) to dice the curvy parts of the path into point lists
- For stroked paths, it then reprocesses the path into a closed polygon of some kind using a complex algorithm
- Then it draws the arbitrary polygon in 1-bit, enabling/disabling/toggling the stencil buffer for each drawn pixel
- It renders a couple polygons over the total area that was enabled in the stencil buffer
- For anti-alias, it simply uses the 3d rendering's multisampling anti-alias

It's kinda complex for what it does? Though it is true that there are many very popular applications that use path rendering (web browsers, "office" applications...).

forwardcom forum

Using Forwardcom as a GPU?

Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?