How to avoid memory fragmentation

kyle-forward · Post by **kyle-forward** » 2021-12-11, 5:15:19

First, just had to say that I really am impressed with the amount of work that has gone into this whole project! Wow! I see a lot of really good and interesting ideas.

I read through the section in the manual on the various things that will help avoid memory fragmentation. Here is a list of the things that I understood and comments:

Stack - as noted, recursion makes this difficult if you cannot tell in advance how much you will have. It seems that recursion is not as heavily used as it once was so this may not be a huge problem, but it will make some algorithms more difficult. Many parsers and similar algorithms use recursion. This seems like a possibly onerous restriction.

DLLs etc - as long as it is clear that the CPU/ISA is a major break, I can follow this though it is not clear how the details work here, at least to me. How are things like ASRL going to be done? That is a relatively common scheme used to help avoid certain kinds of attacks. Is PIC the answer to that? As long as code is relatively static in size, this works. What about JIT systems like the JVM? Code is dynamically created.

Heap - this is where I run into problems. I do not follow the logic that you can come up with useful bounds on individual programs in the general case. Consider compilers, web browsers and database engines. These all have enormously different memory footprints depending on the inputs. Compiling "hello, World" takes a very different amount of memory than compiling the LLVM code base.

If you have very different heap sizes for different programs and they start and stop independently, then you will fragment memory. Existing paging systems work with almost all pages of the same size in order to make sure that all memory/address chunks are the same size and thus it is easy to reuse freed holes. I've followed some of the discussions, years ago, about the addition of huge pages to Linux for DB support and one of the pain points was making sure that you had sufficient contiguous physical address space to allocate a huge page. It is a lot easier to do it when the system boots and nothing has trampled all over RAM yet.

Is this what the end of the discussion of the thread "Implications of ForwardCom memory management approach" started by user JoeDuarte is aiming at solving? Toward the end of that thread a system of virtual addresses (with VIVT caches) was proposed. That sounds workable, at least if I understood the details correctly. However, even that will have some level of fragmentation as long as the basic unit of physical memory allocation is not always the same.

As I mentioned in the other thread, the Mill design uses separated memory translation and memory protection. It also uses a single virtual address space for all processes.

In that same thread, it was mentioned that with VIVT caches, you do not want to have more than one virtual address map to the same physical address. But isn't that exactly what happens when you create a shared memory block between two or more processes?

This leaves me confused. It sounds like having just a few memory region mappings is undermined by memory fragmentation and having variable length page/segments is undermined by fragmentation. VIVT caches seem like they might allow moving memory translation out of the fast path in the top level cache(s). How much of a problem is the problem I noted above?

Best,
Kyle

Post by **agner** » 2021-12-11, 7:19:39

Thank you for your interesting comments.

kyle-forward wrote:

Many parsers and similar algorithms use recursion

True. This even includes the parser in the ForwardCom assembler. But the recursion level is not very deep.

How are things like ASRL going to be done?

The security problems that "Address Space Layout Randomization" is aiming at, do not exist in ForwardCom because they are avoided by the fundamental design: call stack and data stack are separate, data sections cannot be executed, executable sections cannot be read, and pointer tables are stored in read-only memory.

What about JIT systems like the JVM?

The just-in-time compiler and the code it has compiled may be treated as two a separate processes. The JIT compiler is only run once, while the code it has compiled may run multiple times. A code interpreter for script languages is more critical because the memory requirement is difficult to estimate in advance. Script interpreters are slow anyway. So slow in fact, that memory fragmentation is only a small contribution to the total execution time. If you want top performance, don't use script languages.

If you have very different heap sizes for different programs and they start and stop independently, then you will fragment memory.

This is true, and this is the weak point. You can have a very efficient ForwardCom processor for single-use applications such as number crunching, while a use with many running processes with unpredictable memory requirements will be less efficient regardless of which CPU design you use.

As I mentioned in the other thread, the Mill design uses separated memory translation and memory protection.

I removed that (very short) post because the discussion fits here. I don't see how the separation of memory translation and memory protection can mitigate memory fragmentation.

it was mentioned that with VIVT caches, you do not want to have more than one virtual address map to the same physical address. But isn't that exactly what happens when you create a shared memory block between two or more processes?

If a process stays in the same physical memory spot throughout its lifetime then you don't need virtual memory translation at all because everything is position-independent in ForwardCom. In a highly crowded execution environment you may occasionally need to move a process to a different physical address and turn on virtual address translation, with a possible performance penalty. CPUs and RAM are relatively cheap nowadays. If you want top performance, then you would probably have the most critical applications running in its own private computer.

There are different kinds of shared memory blocks. Code sections and read-only data sections are only shared between multiple processes in ForwardCom if they are instances of the same program. Library code is not shared between different running programs in ForwardCom because we have no DLLs. Writable data sections are shared through pointers which are assigned at run time. Such a section will have the same virtual address for all the processes that have access to it.

kyle-forward · Post by **kyle-forward** » 2021-12-13, 1:42:31

What about JIT systems like the JVM?
The just-in-time compiler and the code it has compiled may be treated as two a separate processes. The JIT compiler is only run once, while the code it has compiled may run multiple times. A code interpreter for script languages is more critical because the memory requirement is difficult to estimate in advance. Script interpreters are slow anyway. So slow in fact, that memory fragmentation is only a small contribution to the total execution time. If you want top performance, don't use script languages.

I am not sure where you are determining that JIT compilers run once. That has not been the case for many years. Most high performance JITs of both Java and JavaScript have at least three operating states, per method:

direct interpretation of bytecode (or equivalent).
warm methods get converted into quickly generated, but low performance, native code.
hot methods get converted into slowly generated, but high performance, native code.

Depending on runtime measurements, a given method can move between the three states. Some JIT systems skip the first state entirely.

Browsers are probably the worst of all worlds because the user's clicks drive what code (web page) is loaded and how it eventually gets optimized. To make matters even more "fun", the JavaScript code can, and often does, modify the DOM which can lead to large explosions of heap usage that depends completely on the user's interactions with the web program.

I think that data bases are similar. You do not control the types of queries provided by the user and different queries and different data in tables can completely change the heap behavior.

As I mentioned in the other thread, the Mill design uses separated memory translation and memory protection.
I removed that (very short) post because the discussion fits here. I don't see how the separation of memory translation and memory protection can mitigate memory fragmentation.

The quote was in support of some of the other ideas that pointed out that you can separate the two. Whether it actually ends up being faster/easier/more efficient remains to be seen since there is no actual Mill hardware yet! That said, by separating protection and translation, you may free up bits (i.e. protection bits in a typical page table entry) that could be used for other purposes or to enable smaller entries which increases fetch efficiency and TLB efficiency (smaller data means more entries per unit of transistors/area).

it was mentioned that with VIVT caches, you do not want to have more than one virtual address map to the same physical address. But isn't that exactly what happens when you create a shared memory block between two or more processes?
If a process stays in the same physical memory spot throughout its lifetime then you don't need virtual memory translation at all because everything is position-independent in ForwardCom. In a highly crowded execution environment you may occasionally need to move a process to a different physical address and turn on virtual address translation, with a possible performance penalty. CPUs and RAM are relatively cheap nowadays. If you want top performance, then you would probably have the most critical applications running in its own private computer.

There are different kinds of shared memory blocks. Code sections and read-only data sections are only shared between multiple processes in ForwardCom if they are instances of the same program. Library code is not shared between different running programs in ForwardCom because we have no DLLs. Writable data sections are shared through pointers which are assigned at run time. Such a section will have the same virtual address for all the processes that have access to it.

I think the last paragraph answered my question above with a "yes." I.e. shared memory would be two (or more) different virtual addresses pointing to the same physical location.

Some observations:

the largest element allocated on the heap is almost always much, much smaller than the heap as a whole. This is not always true for things like image or video processing, but even there you tend to have things like layers rather than giant image chunks. The real core of the problem here is to make sure that large objects can be efficiently accessed and allocated, not that all heap is necessarily contiguous.
many modern filesystems use extents to allocate physical disk space to files. These are similar to the ranges used in Forward. One thing that filesystems also do is make things more complicated (multiple levels of extents and i-nodes) as files get bigger. Is there something from these ideas that could be used for Forward?
GC can be very, very fast. While GCing a few GB of physical memory takes a while, it could be a last resort to compacting fragmented memory. Of course, you would still need to fix up internal data offsets for programs.

Generally when there is something painful, you find a way to not do it or to make it less painful. A lot of what Forward is trying to do is the first option. Is there something that can be done to make address translation less painful? One of the things I like about the way that filesystems map file offsets to actual disk extents/blocks is that smaller files are fully mapped in a single inode. You only pay the cost of extra indirection when files get larger. Is there some analog to that for memory access in Forward?

Best,
Kyle

Post by **agner** » 2021-12-13, 7:13:36

The more fragmented the memory becomes, the more complicated becomes the access:

In simple cases where there are few processes running and plenty of vacant RAM, you don't need any address translation. All you need is a small memory map with a few variable-size entries. This memory map can stay on-chip with hardware comparators on each entry so that you can check an address in a single clock cycle. This is the preferred situation when performance is critical.
Virtual address translation is needed if a memory block has been moved while the process is running. The virtual address translator is a hardware adder, probably adding a delay of one clock cycle extra.
With a low level of fragmentation, you can fix it with more entries in the memory map. A memory map with a limited number of entries can still fit on the chip with hardware comparators and adders on each entry. This map works like a content-addressable memory.
A high-performance program may start a new thread when it needs to allocate a large block of memory. Each thread has its own memory map. The memory allocated by a child thread is private to this thread by default so that it doesn't add to the memory maps of mother and sibling threads, while the memory of the main thread is accessible to its children. Thread-private memory is a ForwardCom feature which cannot be used in legacy software that assumes no memory protection between threads.
In the - hopefully rare - case where the memory becomes fragmented beyond the capacity of the on-chip memory map, you may start a garbage collector and reorganize the memory blocks to make them contiguous again. This causes a delay in program execution at unpredictable times. The operating system may prioritize the processes or threads so that high priority threads are fragmented less and thus less likely to be delayed by heap reorganization.
Current systems do little or nothing to avoid memory fragmentation. A highly fragmented memory makes it necessary to have large page tables with fixed-size entries. Intel processors now have page tables with five levels of indirection (https://en.wikipedia.org/wiki/Intel_5-level_paging). This makes page table walks quite long. The huge page tables do not fit on the chip, so you need a TLB for caching, and you get a penalty for TLB misses. The large TLB takes up silicon space, increases power consumption, and limits the clock frequency. This is the situation I want to avoid.

Of course, it is possible to add support for traditional page tables to a ForwardCom processor, but my idea is to investigate possibilities for avoiding this. The advantage of avoiding page tables and TLB would be substantial. This will certainly work in small systems dedicated for a single purpose. Whether it will work on large multi-purpose systems remains to be seen.

The application programmer can do a lot to optimize performance, when needed. This includes using a compiled programming language, avoiding abstract programming tools with many layers, and recycling allocated memory.

kyle-forward · Post by **kyle-forward** » 2021-12-17, 5:43:14

Thanks for the points.

I think the idea that was raised about using VIVT caches covers up most of the cost of memory translation as long as there is no extra lookup. The first three points would be covered by that case. That allows for more flexibility. Certainly most current CPUs are not hugely impacted by translation overhead, as long as the translation data is local on chip.

Having process-like memory boundaries between threads would be a problem for some execution models such as that of the JVM.

The same JVM systems show that very large memories can be garbage collected without significant performance penalties. In the case of ForwardCom, what is needed is not GC but memory reordering and compaction.

Another approach would be to intentionally sacrifice some memory to provide for buffers between process memory segments. This is used in malloc implementations for instance to keep fragmentation low (or at least lower). It works pretty well until memory is heavily used.

To restate the solutions:

programs that have known, fixed, memory usage require little effort.
programs that have some variability in memory usage can be handled with some remapping.
programs with yet more variability and fragmentation can be handled with some larger number of remapping entries.
use of malloc-like techniques may trade memory for fewer remapping entries.
finally memory movement and compaction may be a "last resort" if there is significant fragmentation.

I left off the process/thread ideas as those require changes to programs, not the hardware or OS.

Best,
Kyle

HubertLamontagne · Post by **HubertLamontagne** » 2021-12-19, 23:05:43

Looking at the memory layout of various programs using microsoft's VMMap tool ( https://docs.microsoft.com/en-us/sysint ... oads/vmmap ), Windows does manage to keep some fairly large blocks of memory contiguous, often 10+mb large. Very few memory blocks are just a single 4k page, typical blocks are more in the 100k-900k range of size. So I guess the memory allocator isn't that bad at creating contiguity, or at least enough contiguity to help the memory prefetcher.

forwardcom forum

How to avoid memory fragmentation

How to avoid memory fragmentation

Re: How to avoid memory fragmentation

Re: How to avoid memory fragmentation

Re: How to avoid memory fragmentation

Re: How to avoid memory fragmentation

Re: How to avoid memory fragmentation