Implications of ForwardCom memory management approach

JoeDuarte · Post by **JoeDuarte** » 2020-05-01, 19:40:50

Hi all — What are the software development implications of the ForwardCom approach to memory management?

The approach is summarized here: https://forwardcom.info/#memory_management

There was a thread on Agner's blog, but I can't find it.

What stood out to me was: the stack being calculated in advance (by the compiler), the heap size "may be predicted by statistical methods", no memory pages or lookup tables, and no DLLs or shared objects (.so binaries in the Unix/Linux world).

Will these mean that devices will need more RAM?

How would it impact software development? I guess I'm wondering about both application development and OS development.

Thanks.

Post by **agner** » 2020-05-02, 4:29:18

These methods are intended to reduce memory fragmentation. You don't need more RAM.
Current systems have large page tables and a translation lookaside buffer (TLB). These pages are taking a lot of hardware and software resources. If you don't have memory fragmentation you don't need page tables. You can do with a small on-chip memory map with variable-size memory blocks. In simple cases you don't even need virtual addresses because all addresses are relative. This is my idea - it needs to be tested.

HubertLamontagne · Post by **HubertLamontagne** » 2020-05-04, 22:42:53

Presumably it would work roughly as follows:

- All allocation happens in 4k blocks.
- When a new program starts, its initial allocation is set some distance away from other previous allocations (maybe with a 16mb offset?).
- When your program first allocates more memory, the OS grows this initial allocation in either direction.
- When your program deallocates memory, this creates a gap in the memory layout. The OS records this gap and keeps it in an intermediary state that is neither allocated nor free.
- After that, when your program allocates memory, if there is a gap that fits, it will allocate from the gap instead of growing the initial allocation. Ideally this eliminates the gap, or at least reduces its size. (I'm not sure what's the best heuristic between prioritizing smaller or larger gaps to fill)
- If the initial memory mapping runs out of space, then the OS creates a new second memory area for your program further away in the memory (this creates a new mapping).
- If the OS runs out of memory, then it looks at the gap list and finds the largest available gap. If that gap belongs to your program, then it will split the memory mapping of your program in 2 areas in order to reallocate the gap (this creates a new mapping).
- If your program allocates memory and the OS cannot find a free area that is large enough, then it will combine many smaller remaining gaps in order to be able to satisfy the required allocation (this creates potentially many new mappings).
- If the OS runs out of mappings for your program, then this scheme becomes a software TLB: the OS fits as many mappings as it can. If your program gets a memory protection fault, the OS first checks if the access falls on one of the mappings that couldn't fit. If so, then the OS temporarily evicts one of the less used memory mappings, swaps in the requested mapping and resumes program execution. This way, the system keeps running even in the event that memory gets totally fragmented, you just get more and more TLB swapping interrupts.

Post by **agner** » 2020-05-05, 5:29:45

Current systems are allocating memory blocks of 4kB each - sometimes bigger. I want to avoid the fixed-size blocks. With 4GB RAM you will have a million 4kB blocks. Such a huge page table may be too big to fit on the chip. You may have nested page tables and other complexities. In current systems, a single running program may access a lot of DLLs or shared objects scattered around in memory, each with a shared code memory segment and a non-shared data memory segment. These systems have fixed-size memory blocks in order to make the page lookup process efficient. In my system, there are no DLLs or shared objects scattered around all over the RAM. Instead, the library functions are linked with each executable program in a way that makes it contiguous with the executable. This means that each task or each thread needs to access only a few memory blocks: executable code, read-only memory, and read-write memory. There may be multiple read-write memory blocks if an ill-designed program keeps allocating more heap size rather than allocating a sufficient heap at once. But in most cases, you will have so few memory blocks for each thread that it will have no more than 3 - 5 entries in the memory map. This makes it possible to have a few variable-size entries rather than a huge number of fixed-size entries in the memory map. This is the whole idea. If the program needs 4MB of data RAM, you will have a single entry in the memory map indicating 4MB or any other arbitrary size, rather than a thousand 4kB entries. You can have a stack that grows downwards, a fixed-size static memory segment, and a heap that grows upwards in the same memory map entry. The memory map lookup process with, say, five variable-size memory blocks can be implemented efficiently on-chip with just a handful of comparators. The OS may have many memory blocks to keep track of if many processes are running, but each thread needs only a small memory map that fits on the chip.

I have tried to explain this in chapter 9.2 of the manual, but maybe the explanation needs to be made more clear.

I have made a linker that works as explained here. It is part of the Forw program with source on github. You can even re-link an executable file if some library functions need to be updated.

HubertLamontagne · Post by **HubertLamontagne** » 2020-05-05, 15:06:48

Without the ability to increase the number of mappings when needed, then you'd definitely need more RAM for sure, because you have a lot fewer available techniques to use when RAM gets tight:

- Apps rarely malloc() all their ram in just one initial go. The pattern is more like a dynamic mix of malloc() and free() of various sizes, most happening while loading a file/level/plugin/page/etc, but which can really happen at any moment in any order. There's no guarantee on how fragmented the heap gets. For instance, when you close a tab in a web browser, there's generally a gaping hole left in the middle of the memory map. When RAM gets tight, obviously in a paged OS all the leftover gaps in the heap get allocated to another program. But if you can't increase the number of memory mappings past a few, then your only solution here is to buy more RAM.

- Paged OSes use demand paging, where memory that has been allocated but never used (ex: by a JVM's large initial allocation) can be repurposed. This is used on the memory allocated for the stack, among other things. This delays how fast memory starts to get tight. Obviously a system with a small number of mappings cannot use this technique, so you'd have to buy more RAM instead.

- The slew of memory mapping tricks used in paged OSes to handle disk caching, the memory mapping of files and IO is likely to be difficult to implement without paging.

- When RAM runs out, a paged OS can dynamically select which pages to swap. It can swap out some of a program's pages, then let the program run and swap the pages back in to figure out which memory pages are really needed and which pages are rarely used. This kind of strategy is likely impossible on a low-number-of-mappings system.

- If RAM gets tight, a paged OS can check all the executable data that it knows was loaded from the disk and where the page hasn't been modified and evict all those pages from RAM. If the program later runs code from these pages, then it can reload the data from the disk. This also cannot be done in a low-number-of-mappings system, and you'd have to compensate with more RAM.

Post by **agner** » 2020-05-06, 5:10:56

Maybe I should be more clear about local heap and global heap. The local heap is owned and managed by the program. The global heap belongs to the OS.

A well-behaved program will ask the OS to allocate a big chunk of memory from the global heap and use it for its local heap. If the program has linked lists or variable-size strings, it will make a lot of traffic on its local heap. But it still needs only one memory map entry as long as the local heap doesn't overflow. The worst case is when the heap requirement of a program is unpredictable and varies a lot so that it may ask the OS for more memory.

JoeDuarte · Post by **JoeDuarte** » 2020-06-01, 0:11:35

Apropos of this discussion, Microsoft is rolling out a new segment heap architecture that seems similar in some respects: https://www.blackhat.com/docs/us-16/mat ... ernals.pdf

JoeDuarte · Post by **JoeDuarte** » 2020-06-15, 12:50:46

I was thinking today that I don't know why you care about not having page tables and a TLB. What's the problem? It takes hardware resources? So what? Is it really a big a problem to have page tables and TLBs, or is this more of an esthetic preference?

One other thing to keep in mind is that you don't need to have 4 KB pages. That's just an old historical standard. There's research on increasing default system page size to 8, 12, and 16 KB, all of which increase application performance. And it would help simplify the TLB. I think 16 KB pages were supposed to be the sweet spot, but I can't find the paper.

Note that the RISC-V project initially decided to use 8 KB pages, but they switched back to 4 KB for compatibility reasons. They're basically a PastCom architecture, thoroughly rooted in the past...

Post by **agner** » 2020-06-15, 13:09:58

A TLB lookup may take 1-2 clock cycles. A TLB miss costs many clock cycles. The chip area used for the TLB could be used for making a larger data cache or code cache instead. You can't have unlimited chip area without slowing down everything.

HubertLamontagne · Post by **HubertLamontagne** » 2020-06-24, 20:37:06

JoeDuarte wrote: ↑2020-06-15, 12:50:46 I was thinking today that I don't know why you care about not having page tables and a TLB. What's the problem? It takes hardware resources? So what? Is it really a big a problem to have page tables and TLBs, or is this more of an esthetic preference?

One other thing to keep in mind is that you don't need to have 4 KB pages. That's just an old historical standard. There's research on increasing default system page size to 8, 12, and 16 KB, all of which increase application performance. And it would help simplify the TLB. I think 16 KB pages were supposed to be the sweet spot, but I can't find the paper.

Note that the RISC-V project initially decided to use 8 KB pages, but they switched back to 4 KB for compatibility reasons. They're basically a PastCom architecture, thoroughly rooted in the past...

I guess it's in line with how ForwardCom is supposed to be a break with existing stuff.

If you add a page table, and stuff like hardware timers, DMA and interrupts, that's basically the path to implementing Linux (or other similar OSes like BSD) and kindof becoming another platform for a Chromebook or other similar existing computer systems. Add in a GPU and you have another one of those SoC's that run Linux, like another Raspberry Pi... extremely cost effective and very nice in many ways, but not exactly evolution system design wise - you're just building a PC on a much smaller form factor essentially.

By taking out the page table, you're taking out the possibility of running Linux (and Windows and OSX and BSD), so you kindof have to come up with something different to get out of that corner... Maybe that's the intended effect, I guess? At least, that seems to be the intent here, right?

Post by **agner** » 2020-06-25, 6:13:38

You may have to emulate fixed size page tables to run Linux etc. Or you may want to invent something new.

ForwardCom can make memory blocks that are executable but not readable to make a true Harvard architecture, and memory blocks that are readable but not executable for security reasons. It can also make private memory spaces for each thread. There are many possibilities for inventing new more efficient memory models.

It depends a lot on the application. The first applications are likely to be single-purpose SoC systems with no OS.

It may be necessary to add a traditional TLB for some purposes that require compatibility with existing OS'es, but the intention is to experiment with something new that is likely to be more efficient.

Sebastian · Post by **Sebastian** » 2020-08-06, 7:05:56

Have you considered a shared virtual address space model? That puts the TLB on the slow path (only on cache miss, since you'd cache virtual addresses). You could use your idea of a small number of variable size regions for protection only, since that still has to be fast.

EDIT: to clarify that last point, the protection VA ranges don't have to perfectly match the virtual memory mapping, it just has to be a supserset. That avoids one of the biggest issues with a small number of variable sized ranges. Because these variable size "protection ranges" are not responsible for mapping virtual addresses to physical addresses, you can just pre-allocate a massive range (e.g. 1TB) for a process. Then you fill in virtual memory pages within that range as you go (with a typical page table, but "below" the cache so it can be slow). So you can allocate a page of physical memory at a time, and as long as the virtual page you map it too is within that big protection range you pre-allocated, you never have to touch the protection side. So then you probably could get away with a very small number of these variable sized protection ranges, and those are the only things you have to consult when reading from the cache.

Post by **agner** » 2020-08-09, 4:54:59

Sebastian, I am not sure I understand your idea. Why do you think memory protection is critical for the speed, but address translation is not?

Sebastian · Post by **Sebastian** » 2020-08-23, 23:29:28

The key is that you use a shared virtual address model. That means you're caching virtual addresses. So address translation only happens on last level cache misses, making it less critical for performance (giving your more options for how to implement it in terms of power, area, etc.). Protection still needs to be fast, but protection seems easier to handle with the "small number of variable sized" regions than address translation (e.g. as I mentioned you could just reserve 1TB of VA space for protection without actually allocating memory pages for more than the application needs, giving you space to add more and more pages on demand while staying with the same protection "region").

Post by **agner** » 2021-03-02, 8:01:23

When I google for "shared virtual address model" I get something with CPU and GPU sharing the same virtual addresses, but still with fixed-size pages. I think there is little need for a GPU when the CPU has long vectors.

But address translation after the cache may be a very good idea. I would like to hear what you think about the following model:

The CPU has no TLB and no page tables, only a small memory map with variable-size memory blocks.
We are using various methods for reducing memory fragmentation as discussed in the ForwardCom documentation.
The cache is using virtual addresses. Each CPU core or process has its own cache and a large virtual address space.
Virtual address translation is needed only in problematic cases when running many processes with highly unpredictable memory requirements, or software and OS that makes heavy use of memory mapped files and devices.
Virtual address translation is inserted between cache and main RAM. Virtual address translation is done on cache misses and prefetching.
The virtual address translator could use fixed-size pages and huge page tables like current systems, but it would be easier to use variable-size pages so that we avoid the page size compromise and the huge page tables.
A virtual address translator with variable size pages could use a binary search algorithm or a binary tree or some similar method for address translation. This will take a few more clock cycles than a traditional TLB, but still less than the access time to main RAM.

Do you think this will work? What would be the best way to implement the virtual address translator?

Communication between multiple threads in the same process may be a problem if they have separate caches.

As the trend in microprocessor technology goes towards an increasing number of CPU cores, we may be able to allocate one core to each of the critical processes or threads with its own cache so that task switches are rare. Most memory allocations take place when new processes or threads are started. An ill-behaved process that keeps re-allocating memory unpredictably will have inferior performance, but it will not disturb other processes. Hyperthreading (running two threads in one core) should not be used.

HubertLamontagne · Post by **HubertLamontagne** » 2021-03-03, 23:32:13

agner wrote: ↑2021-03-02, 8:01:23 When I google for "shared virtual address model" I get something with CPU and GPU sharing the same virtual addresses, but still with fixed-size pages. I think there is little need for a GPU when the CPU has long vectors.

Yeah I think "shared virtual address model" is used for two different things: Virtually-indexed virtually-tagged caches (see below), and yes IOMMUs. IOMMUs are one of the many schemes used to communicate between peripherals and programs using memory-mapping. Off the top of my head in increasing order of complexity:

- The OS reserves an IO buffer in contiguous memory. Whenever programs interact with the driver, the OS copies data to/from program memory (mapped) to this buffer (linear). This is fine with devices that have low IO bandwidths.

- The OS reserves a linear address range for the device, but it exposes this address range through memory mapping to programs. (for instance, windows would fake a large linear framebuffer for SVGA cards that used 64 pages this way - accessing VRAM on a different page would cause a page fault and the OS would change the mapping accordingly).

- Instead of using linear DMA, the device uses chain-DMA and a transfer can be done over multiple ranges linked together. This allows the OS to split a transfer along 4k page boundaries and removes the need to allocate linear memory ranges. Afaik all modern DMA controllers use this.

- IOMMUs are the most aggressive mechanism and are designed for devices with very large bandwidth usage (ie GPUs streaming per-frame textures and per-frame vertex buffers and large numbers of commands). The video card driver directly sends untranslated memory addresses from your program along with a process-tag to the GPU. This is a complex mechanism with all sorts of fun stuff like IOMMU page faults. I don't think it makes sense for devices that don't transfer gigabytes of data. I imagine that on Forwardcom, that sort of device would end up being some kind of bilinear-texture-mapping accelerator.

agner wrote: ↑2021-03-02, 8:01:23 But address translation after the cache may be a very good idea. I would like to hear what you think about the following model:

The CPU has no TLB and no page tables, only a small memory map with variable-size memory blocks.

We are using various methods for reducing memory fragmentation as discussed in the ForwardCom documentation.

The cache is using virtual addresses. Each CPU core or process has its own cache and a large virtual address space.

Virtual address translation is needed only in problematic cases when running many processes with highly unpredictable memory requirements, or software and OS that makes heavy use of memory mapped files and devices.

Virtual address translation is inserted between cache and main RAM. Virtual address translation is done on cache misses and prefetching.

The virtual address translator could use fixed-size pages and huge page tables like current systems, but it would be easier to use variable-size pages so that we avoid the page size compromise and the huge page tables.

A virtual address translator with variable size pages could use a binary search algorithm or a binary tree or some similar method for address translation. This will take a few more clock cycles than a traditional TLB, but still less than the access time to main RAM.

Do you think this will work? What would be the best way to implement the virtual address translator?

Yes, that's second meaning of "shared virtual address model", which is the same as "VIVT" (Virtually indexed virtually tagged) cache (see https://en.wikipedia.org/wiki/CPU_cache ... ranslation). It means that L1 cache uses virtual addresses, but not necessarily larger caches or memory downstream (which generally have multi-cycle access times anyways).

The downside of VIVT is that you absolutely can't have multiple virtual pages leading to the same physical memory, except if they are all in read-only mode. If you have multiple mappings and there's memory write, you get inconsistent memory (it will look like different memory in L1 but the same memory if it gets flushed to L2). This situation does happen in real OS's (for instance, if you use a memory-mapped file in Linux for inter-process communication AFAIK). There are some schemes to deal with this in academic papers (page coloring etc) but I absolutely don't know how reliable they are and I think it would be a good thing to look into.

Afaik most CPUs implement a compromise between VIVT and physically-indexed cache (ie: the data cache looks like physically-indexed cache to software, but the internal organisation is designed for fast virtual indexing).

What I think Sebastian was talking about for memory protection is that you can have a situation where protected memory (that should be accessible by the OS but not the program) is present in cache... the cpu has to figure out if trying to access the memory causes an interrupt independently from if the data is present in the cache (though maybe that can be done with cache line flags).

Anyhow, VIVT is entirely sufficient as a memory mapping system by itself as far as I can tell (aside from memory protection and dealing with duplicate mappings). I'm not sure you need other systems to go with it (such as a small memory map on the CPU).

Maybe what you could do for variable-sized memory blocks is that you could have two types of page table entries: normal pages with the full mapping, and "continued" pages where the page mapping is the same as a nearby previous one but with some offset.

agner wrote: ↑2021-03-02, 8:01:23 Communication between multiple threads in the same process may be a problem if they have separate caches.

As the trend in microprocessor technology goes towards an increasing number of CPU cores, we may be able to allocate one core to each of the critical processes or threads with its own cache so that task switches are rare. Most memory allocations take place when new processes or threads are started. An ill-behaved process that keeps re-allocating memory unpredictably will have inferior performance, but it will not disturb other processes.

I guess with "continued" memory pages you could gracefully degrade from large-memory block behavior to page-table-tlb behavior.

agner wrote: ↑2021-03-02, 8:01:23Hyperthreading (running two threads in one core) should not be used.

I think that one is an implementation-specific detail, and it really depends on the kind of program you want to run on your core.

Hyperthreading is pointless on single-threaded programs that are easy to schedule and make good use of the CPU resources, indeed. It's really more for cases where you know you'll have tons of threads (servers) where the thread's workload can't be vectorized (servers) and tends to have lots of difficult-to-predict memory accesses and jumps (servers). This is why the Ultrasparc was aggressively hyperthreaded but otherwise simple (lots of threads doing jumpy integer code), while inversely something like the Pentium 3 wasn't (software was single-thread, could use SIMD, and the core was aggressively out-of-order for its time so it could fill stalls with useful work).

forwardcom forum

Implications of ForwardCom memory management approach

Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach

Re: Implications of ForwardCom memory management approach