Using CPU cores as GPU

Kulasko · Post by **Kulasko** » 2021-01-29, 0:56:53

This is meant as more of a general topic for people who wish to comment or inform themselves about using CPU cores (or a CPU ISA) as a GPU accelerator.
If this can be done efficiently, ForwardCom should be a good fit for such a use case as it allows for very large vectors.

A company called Pixilica is currently developing an open source GPU based on RISC-V, with an instruction set extension for this very purpose.
I would love to hear the opinion of someone more well-versed with GPU architecture (and/or rendering APIs) about this.
https://www.pixilica.com/copy-of-home

Post by **agner** » 2021-01-29, 6:51:07

Thank you for the reference. I am not an expert in graphics processing, but I think that ForwardCom is well suited for adding graphics processing instructions. The fundamental data types and elementary instructions proposed by Pixlica are available in ForwardCom. ForwardCom has the further advantage of variable-length vectors and scalability.

There is no 24 bit data type in ForwardCom, but RGB pixels can be coded as three 8-bit elements in a vector.

ForwardCom supports half precision floats which may be useful in graphics and in neural networks.

Transcendental functions (log, exp, sin, etc.) with vector operands are implemented in a function library (only sin and cos are currently implemented but the other functions are planned, using the same algorithms as in my Vector Class Library). This is more efficient than microcode implementation. (The x87 instruction set has transcendental functions in microcode. These are inefficient and today mostly replaced by software libraries.)

The concept proposed by Pixlica seems to rely heavily on microcode. I have deliberately avoided microcode in ForwardCom because it is likely to be inefficient. I prefer to implement complex functionality either as state machines in hardware or as a series of simpler instructions. There is plenty of room for adding custom designed instructions to ForwardCom, including graphics processing instructions.

HubertLamontagne · Post by **HubertLamontagne** » 2021-01-29, 19:47:15

Seems that they are walking down the path of the Intel Larrabbee eh? :)

I imagine the texturing unit would be designed as 4 parallel data caches, and a bilinear-interpolated texture lookup would read the even-even, odd-even, even-odd, and odd-odd textels around the requested texture coordonated at the same time, so you'd have a rate of 1 texel per cycle (some more recent versions expand this to 8 parallel data caches for trilinear interpolation). This is a common design and shows up as early as the Nintendo 64 (which had a 8kb "texture memory" that you had to explicitly fill), Dreamcast (the texture compression is done in 2x2 chunks for a good reason...), PS3 (still does 1 texel per cycle but you had 8 texturing units shared by 24 gpu units), and is still used in mobile GPUs.

This means that a RISC-V processor designed for this would implement some kind of hyper-threading. This is so that the processor could initiate a texture look up, and switch to another thread while the texture unit looks up the texels one-by-one. That probably explains why the vectorized register file has 1024 elements (32 threads x 32 registers per thread? or fewer threads with a block used to store polygon invariants?).

Having 136-bits per register probably means that it includes 4x32bits for data and 8 masking bits... which makes it similar to the Larrabbee too (afaik, SIMD masking in x86 was introduced by the Larrabbee).

The separate "Special Function Unit" is probably included so that you can run the transcendental functions in parallel with other computations... Every time something like cos() shows up in a shader, at 1920 x 1080 x 60fps that's already 125 million cos() executions per second, so it probably made sense to them to have a a specialized unit. AI has a similar problem, with tanh() often used to saturate results after a matrix multiply, which results in millions if not billions of tanh() calls per second. Tanh is used over other saturating funcitons because it is differentiable over an infinite range. Transcendental functions are often difficult to parallelize because you can't really use Look-up-tables without hammering the data cache (one access per vector item).

The built-in framebuffer is interesting... I guess they need it to deal with the memory accesses that are relatively simple but are extremely wide. For instance, if you have 32bpp pixels and a 32bit Z-buffer and you're using 4x multisampling anti-alias and you're processing 4 pixels in parallel and you're drawing a transparent polygon, you're looking at reading and writing 1024 bits of data per execution (!), so I imagine a specialized RAM broadband is in order. Afaik, the Larrabbee used Tile rendering to tackle this problem - making tiles small enough that the data cache acts as the framebuffer.

The built-in rasterization hardware is interesting as well... On the Larrabbee they didn't need it - they used a kind of hierarchical integer compare test per 4x4 pixel block to do it... Though I imagine using a separate unit to deal with it reduces complexity by removing the need to use complex hierarchical rasterizing code in the rendering loop.

I'm not sure what's the use for the microcode sram they included, but I imagine it's more for handling exceptional cases such as boot-up, pipeline transitions during interrupts and faults, MMU/TLB-miss operations, instruction cache flushing, special ordered memory writes etc... Generally operations that aren't speed-sensitive but involve messy micro-architecture details.

HubertLamontagne · Post by **HubertLamontagne** » 2021-02-03, 22:02:46

Using a specialized RISC core as a starting point for a GPU makes sense... For instance in ATI/AMD's GCN is basically a very specialized RISC (for reference, GCN1 came out in 2012, PS4 and XBoxOne are GCN2, PS4pro and XBoxOneX are GCN4).

Here's the instruction set document for GCN1:
http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf

Some highlights:
- Very wide vectors (64 lanes!, each lane is 32bit floating point)
- Intensive use of masking in vector operations with specialized 64bit mask registers, vector ops are masked by default. Can branch depending on if all lanes are 0 in a mask register, or use this to conditionally set a special mode that turns vector operations into NOP.
- "Wavefronts" (irl=threads) are grouped together, presumably to keep together stuff that read the same textures
- Intensive use of hyperthreading. The number of wavefronts per real CPU seems to be between 4 and 40 (4 groups of 1-10), with the scheduler cycling between all 4 groups.
- Very large and complex register files, includes values such as hardwired constants
- Presence of a "Local data store", which seems to be a specialized memory smaller than DRAM/L2 Cache but larger than register files.
- Floating point mode registers.
- Vector transcendental operations.
- Extremely complex load and store operations, both scalar and vectorized, using multiple paths (presumably for different frame/z/etc buffers, vector buffers, textures, etc), built-in texturing including interpolation and mip-mapping.

forwardcom forum

Using CPU cores as GPU

Using CPU cores as GPU

Re: Using CPU cores as GPU

Re: Using CPU cores as GPU

Re: Using CPU cores as GPU