forwardcom forum

HubertLamontagne

When I google for "shared virtual address model" I get something with CPU and GPU sharing the same virtual addresses, but still with fixed-size pages. I think there is little need for a GPU when the CPU has long vectors. Yeah I think "shared virtual address model" is used for tw...

HubertLamontagne

Using a specialized RISC core as a starting point for a GPU makes sense... For instance in ATI/AMD's GCN is basically a very specialized RISC (for reference, GCN1 came out in 2012, PS4 and XBoxOne are GCN2, PS4pro and XBoxOneX are GCN4). Here's the instruction set document for GCN1: http://developer...

HubertLamontagne

Seems that they are walking down the path of the Intel Larrabbee eh? :) I imagine the texturing unit would be designed as 4 parallel data caches, and a bilinear-interpolated texture lookup would read the even-even, odd-even, even-odd, and odd-odd textels around the requested texture coordonated at t...

HubertLamontagne

Still, I'm not convinced that this is a net decrease in complexity. This creates a small new memory area, with its own addressing scheme separate from the main RAM. Since it's not affected by RAM area mapping, it needs to be integrally copied to/from main RAM during context switches. That means you ...

HubertLamontagne

If you can do efficient bilinear texturing on a general purpose CPU, that would already be better than Intel (who had to add a texturing unit to the Larrabbee) and Sony (who had to add a GPU to the PS3 when they realized that they couldn't efficiently texture polygons on the Cell). And it would prob...

HubertLamontagne

I imagine if it ever comes to that, with large 3d graphics adapters, by that point you'd probably memory-map the device space and use something like paging and chained DMA and bus mastering and even an IO-MMU (which is now a thing on modern PCs). The culmination of this trend is that on the PS4, whe...

HubertLamontagne

I was thinking today that I don't know why you care about not having page tables and a TLB. What's the problem? It takes hardware resources? So what? Is it really a big a problem to have page tables and TLBs, or is this more of an esthetic preference? One other thing to keep in mind is that you don...

HubertLamontagne

Is the logic for multi-register push and pop too specific to be shared? I thought that perhaps, the multi-register logic that already is needed because of this could be used as a more general multi-register instruction generator. That would enable multi-register instructions without much added comp...

HubertLamontagne

I guess this is about the limit of where I can help, because there are like half a dozen styles of pipelines (single-isssue, atom-style load-alu single issue, dual-issue, VLIW, simple OOO where every load/store/ALU op is a full independent micro-op, OOO with micro-op fusion, OOO with clustering), an...

HubertLamontagne

I'm very much reminded of the register-window stack on SPARC which has similar semantics, and stores return addresses as part of its mechanism: http://icps.u-strasbg.fr/people/loechner/public_html/enseignement/SPARC/sparcstack.html Any local variable that is stored on the stack and that isn't access...

HubertLamontagne

If tiny instructions are rare and are mostly used by register saving/loading to stack, and load/store multiple or load/store pair (the ARM64 equivalent) is simpler to implement, then it makes total sense to go for the multi-register loads/stores instead yeah. I think the extra complexity in the regi...

HubertLamontagne

ARM is pretty good at this (obviously it has to, since so much of their business is embedded cores), with the whole gamut: - Stripped down 32bit (modern small microcontrollers) - 32bit (lots of microcontrollers and smaller cores, GBA and NDS) - 32bit + FPU (used in some microcontrollers for DSP-heav...

HubertLamontagne

Without the ability to increase the number of mappings when needed, then you'd definitely need more RAM for sure, because you have a lot fewer available techniques to use when RAM gets tight: - Apps rarely malloc() all their ram in just one initial go. The pattern is more like a dynamic mix of mallo...

HubertLamontagne

Presumably it would work roughly as follows: - All allocation happens in 4k blocks. - When a new program starts, its initial allocation is set some distance away from other previous allocations (maybe with a 16mb offset?). - When your program first allocates more memory, the OS grows this initial al...

HubertLamontagne

[...] Look up nvpath. It's Nvidia's feature that accelerates vector graphics on their GPUs/3D accelerators. It's a very interesting extension, and Adobe has used it to good effect in their Creative Cloud applications, probably Photoshop and others. I've taken a look at it. It's kinda weird but it m...

HubertLamontagne

This reminds me that fast Bezier curve performance would be very useful, without having to worry about being a traditional GPU. Bezier curves are central to a lot of 2D rendering, including fonts and vector graphics like SVG. Some are quadratic and some are cubic. I'm looking into this and it seems...

HubertLamontagne

Thank you for your replies. Having instructions that require multiple micro-ops doesn't necessarily mean you need microcode. You are right, I didn't think about that. I still have some difficulty dealing with this as it complicates the decoding stage. Presumably, decoders don't have a constant late...

HubertLamontagne

Having instructions that require multiple micro-ops doesn't necessarily mean you need microcode. ARM has many instructions that are multiple-uop (for instance, load+increment a pointer, multiple register store/load...). "Call" and "Ret" are inherently multiple-uop, since you load...

HubertLamontagne

Yeah. The idea of a Forwardcom Xeon-Phi / Larrabbee makes sense to me, especially since it plays to Forwardcom's strengths (lots and lots of vector instruction). Clearly you could go with the UltraSparc way - make it in-order to make the cores small, use really aggressive hyper-threading with lots o...

HubertLamontagne

For sure. I have no illusions - such a project would be likely to turn out like the ill-fated Larrabbee (the articles about its demise are confusing, but they seem to imply that it had something like half the perf of dedicated GPUs, with the drivers still in alpha stage as another generation of GPUs...

HubertLamontagne

Considering how vector and throughput-oriented Forwardcom is, I've been wondering if it would make sense as a GPU. It should be pretty good at vector processing at least. It might make sense to use tiled rendering. For rasterization, you'd use the full vector register size all the time, with registe...

HubertLamontagne

I'd imagine this would require special handling in the load/store unit if the vector isn't constant sized. Something where you get a kind of double-slot read micro-op or 2 micro-ops (there needs to be 2 reads if the image straddles a cache line boundary anyways), and if there's a potential read faul...

HubertLamontagne

You'd start with an in-order implementation in Verilog (or VHDL), I'd think, using block RAM instead of DRAM for instruction memory and data memory at first... and probably no vector support initially, and not too much pipelining at first. Then, you'd build up from there. You'd presumably start with...

HubertLamontagne

Presumably, the OS would need to use something like Buddy Memory Allocation system-wide to keep allocations contiguous as much as possible and to limit the number of mappings (and to be able to do multiple hundred megabyte allocations at all). Excessively large mappings that get swapped to disk woul...

HubertLamontagne

A string machine? How would you implement this string cache? Some kind of fast hardware hash function that processes 32 bytes at the time? Hardware accelerated UTF8 character loading and capitalization changes? In particular, the hardware assisted string bank updating sounds really hard to build in ...

forwardcom forum

Search found 80 matches

Re: Implications of ForwardCom memory management approach

Re: Using CPU cores as GPU

Re: Using CPU cores as GPU

Re: Separate call stack and data stack

Re: input/output instructions

Re: input/output instructions

Re: Implications of ForwardCom memory management approach

Re: Multi-register instructions

Re: Proposal to drop tiny instructions

Re: Separate call stack and data stack

Re: Proposal to drop tiny instructions

Re: Heterogenous cores / instruction sets

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Re: Possible difficulties for microcode-less implementations

Re: Possible difficulties for microcode-less implementations

Re: Using Forwardcom as a GPU?

Re: Using Forwardcom as a GPU?

Using Forwardcom as a GPU?

Re: Possible difficulties for microcode-less implementations

Re: Putting it on real hardware

Re: Handling paging without a page system

Re: Interesting new ISA: MRISC32