Understanding Memory Management – Where do I begin?

Well, not all of us have had the opportunity to sit in a Computer Architecture class and learn about memory and related things, have we? But if you’re interested in knowing about how memory works and the software built around it, here’s something I found very helpful two months ago, when I was a complete newbie. It is a long document covering the essential fundamentals of memory, written by Ulrich Drepper broken down into 9 parts on LWN.net. It is also available in a paper format, for those interested.

What every programmer should know about memory – Part 1

There are links to the other parts in the article.

And here’s the same content in paper format

About time!

If you are observing how fast your system is, you’ll likely come across units of time like nanoseconds and milliseconds which¬†you have might have trouble understanding. Of course, you know that these are very small values but how exactly do you compare them? I came across¬†an old video of Grace Murray Hopper explaining these things using wires.

You also might have heard very often that disks are really slow compared to main memory, which is itself slow compared to the CPU caches and of course, also about other components of the memory hierarchy. But how do we understand how slow these things are when latencies are spoken about in nanoseconds and milliseconds, terms people¬†mostly don’t use in everyday life?

I chanced upon this figure in Brendan Gregg’s Systems Performance: Enterprise and the Cloud:

LHu6r

This table explains how slow things are in terms of units of time we understand, so comparison is relatively easy.

Page Allocation in the Linux Kernel – Part 2: First Allocation Attempt

This is a follow-up to part 1 on the events in __alloc_pages_nodemask().

In the first allocation attempt, __alloc_pages_nodemask() calls get_page_from_freelist().

Function header:

static struct page *get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order, struct zonelist *zonelist, int high_zoneidx, int alloc_flags, struct zone *preferred_zone, int migratetype)

Parameters passed from the calling function:

  1. gfp_mask | __GFP_HARDWALL : The hardwall bit is set to enforce the cpuset memory allocation policy (cpusets restrict processes to processors and some memory nodes. Detailed information can be found here).
  2. nodemask
  3. order
  4. zonelist
  5. highzone_idx of zone_type
  6. ALLOC_WMARK_LOW | ALLOC_CPUSET for alloc_flags : This allows checking for correct cpuset
  7. preferred_zone
  8. migratetype

Here’s where each zone in the zonelist is scanned for free pages.

The int variable zlc_active is set to 1 when cached zone data is used to skip over the zones that are almost full. This is set to 0 before the loop begins.

For each iteration, these are the steps executed.

Step 1:

if (NUMA_BUILD && zlc_active && !zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;

What this means is, if NUMA is configured in the system (NUMA_BUILD = 1 when NUMA is configured), there is a check for zlc_active and then a call to zlc_zone_worth_trying() sees if the zone can be considered further for free memory. This check is discussed in detail in the comments here. If the condition is not satisfied, the next zone in the zonelist is considered, skipping over the current one.

Step 2:

if ((alloc_flags & ALLOC_CPUSET) && !cpuset_zone_allowed_softwall(zone, gfp_mask))
                               goto try_next_zone;

ALLOC_CPUSET asks the allocation to check for the correct cpuset and the call the cpuset_zone_allowed_softwall() checks if the current zone can be used for allocation.

If the condition is false, the next zone is tried after setting up zonelist cache if it hasn’t been done yet, if the system is NUMA aware and the number of zones is greater than 1. Here is where zlc_active is set to 1.

STEP 3:

If watermarks have to be checked for this allocation (this is known by checking if ALLOC_NO_WATERMARKS is set in alloc_flags), it is first checked if the zone watermarks are satisfied for the allocation by a call to zone_watermark_ok(). If they are, the current zone is tried for allocation.

If they aren’t satisfied, zone reclaim is attempted if zone_reclaim_mode is set. If not, the next zone is tried after some zonelist cache setup like in Step 2. Zone reclaim basically reclaims unmapped pages, which are not mapped to any process address space and don’t appear in any process’ page table. The code for zone reclaim can be read here. Zone reclaim can take time. Therefore, if allocation cannot be delayed, zone reclaim is skipped and an appropriate flag (ZONE_RECLAIM_NOSCAN) is returned. This is not the only case where the flag is returned. Do read the code to know more about the other cases in detail.

If zone reclaim did not scan, the next zone is tried. If it was scanned but pages were not reclaimable (returned ZONE_RECLAIM_FULL), the zone is marked full in NUMA aware systems and the next zone is tried.

Step 3 is skipped if watermarks don’t have to be checked.

Step 4: Page allocation is attempted in the current zone with a call to buffered_rmqueue(), which returns a pointer to a page structure. If NULL is returned, the allocation was unsuccessful. If the allocation was successful, the loop is broken out of and the page is returned.

If unsuccessful, the next zone is tried.

If all the zones in the zonelist have been iterated over and the allocation is still unsuccessful, the zonelist is scanned again if the system is NUMA aware and zlc_active is set.

If this attempt returns NULL, it means that the allocation was not successful. In that case, the slowpath is tried. The next post in this series is dedicated to the slowpath.

Page Allocation in the Linux Kernel – Part 1: Bird’s eye view

Sorry about the boring caption ūüėõ This post gives an overview of memory allocation in the kernel (as of version 4.7).

All page allocations are handled by __alloc_pages_nodemask() in page_alloc.c. As you can see in the comment above, this function is very much the heart of the buddy allocator. Let’s go over what it does, step by step.

The parameters passed to it are

  1. gfp_t gfp_mask : This is essentially a bit mask that specifies how allocation should be made (eg, DMA, can enter into filesystem, atomic etc. The entire range of possibilities can be seen here). As an aside, This talks a little more about these flags, for those interested. And gfp stands for get free pages, to demystify it a little.
  2. unsigned int order : This gives the order of the allocation. To recall, page allocation is done in powers of 2. In general, when allocation is done for 2^n pages, it is said to be an nth order allocation.
  3. struct zonelist *zonelist : Kernel documentation FTW. Read the comment for struct zonelist here to understand why it is used here.
  4. nodemask_t *nodemask : This contains node information for the current allocation.

First, gfp_mask is used to find the zone type of the allocation. The zone types are given by the enum zone_type, which is already beautifully documented here. The mask is also used to find the corresponding migrate type (line 2116), which I will talk about later.

After a check to see if the allocation can succeed (by looking at the gfp_mask and the order in should_fail_alloc_page()) and another to see if the zonelist has at least one valid zone for the current allocation (the attempt is aborted when either fails), the preferred zone is determined. This should help you understand. If for some reason, the preferred zone is NULL, the attempt is aborted.

When all the above conditions are satisfied and all is well, the first allocation attempt happens. get_free_pages_from_zonelist() is called and it tries to allocate pages going through the zonelist.

If the first attempt fails, the kernel enters what is called the slowpath for page allocation.

Both these attempts will be discussed in future posts.

Here’s a rough¬†sketch to explain the above steps. The parts highlighted in red return on failure, ie, allocation attempt is aborted if they fail. S stands for success and F for failure (in those little arrows in the end).

Intern - 2

 

The whatsit page

I realized if I have to write more about my work, I’ll have to throw around a lot of jargon and it’s not really fun if you haven’t yet had proper introduction to memory management terms or it’s been a while and you have trouble recollecting them. This post talks very briefly about some of the most frequently used terms in Linux memory management. I encourage you to read more about whatever topic you might be a stranger to. I intend to keep updating this post. So the current list is not complete by any means (I don’t think it can ever be).

Memory pages

These are the basic units of memory management in Linux. The sizes of pages may vary with architecture. 4 KB pages seem to be common in most 32-bit architectures. Larger page sizes (8 KB, 16 KB, etc) are more common in 64-bit architectures.

When configured, Linux also supports Huge pages, which are much larger pages.

The size of the huge pages on a system can be obtained using the command

$ cat /proc/meminfo | grep Hugepagesize

while

$ getconf PAGE_SIZE

should give the size of regular memory pages.

Virtual memory 

This is what memory management folk¬†fondly refer to as ‘The VM’. It is a technique used by operating systems wherein the memory addresses used by processes are translated from a virtual address space to a physical address space by the MMU.

Since the addresses used are virtual, a process can run under the impression that it has way more memory than physically available, as the system can keep only what is currently required in the physical memory and bring in the rest later.

Another important benefit is protection of the system. Each process has its own virtual address space and cannot affect another process’s address space unless some sort of sharing is established.

Demand Paging

Linux uses Demand Paging, which is a technique which allows the system to postpone the allocation of pages as long as possible. For example, when a process is executed with only a portion of its executable file in RAM, a page fault occurs when it tries to access what has not been brought to memory. The required pages are loaded on demand from the disk. Similarly, when huge files are accessed, only parts of them are kept in memory and the rest of the parts are loaded only when required.

Another kind of paging is called Anticipatory Paging, which is sort of the opposite of this. Here, pages are brought to memory well in advance in the hope of minimizing page faults.

Non Uniform Memory Access or NUMA

When there are multiple CPUs in a system, it is possible that some banks of memory are closer to some CPUs than the rest. In such cases, the memory closer to a CPU can be accessed much faster and significant delays may be encountered when memory is far away, especially in large systems. Such systems are called NUMA systems.

Uniform Memory Access or UMA

Here, all processors share the physical memory uniformly. There are no differences in the memory access times of different processors.

Node

Each bank of memory used in a system is called a node. In NUMA systems, there are multiple nodes. When memory allocations happen in NUMA systems, the node closest to the running CPU is considered for allocation of pages.

Zone

Each node is divided into zones. These zones are divided based on their uses. A quick look at the enum zone_type in include/linux/mmzone.h¬†shows that there can be up to 6 zones depending on the configuration –

ZONE_DMA For Direct Memory Accesses
ZONE_DMA32 Needed by x86_64
ZONE_NORMAL Normal addressable memory (DMA possible)
ZONE_HIGHMEM
ZONE_MOVABLE
ZONE_DEVICE

The kernel tries to allocate memory in the preferred zone. If no memory is available in that zone, fallback zones are considered in the order of priority.

Order

In the Linux kernel, memory pages are allocated in powers of 2. The integer to which 2 is raised is called the order. For example, when a single page is requested, it is considered a zero order allocation.

Buddy Allocation/ Binary Buddy Allocation

The linux kernel memory allocator is called the Buddy Allocator. Memory is divided into several blocks. The number of pages in each block is raised to an integer (ranging from zero to MAX_ORDER, defined in the kernel) i.e, 1, 2, 4, 8 and so on are possible block sizes. If allocation of a certain block size cannot be fulfilled, higher order blocks are divided in half. These two halves are called buddies.  This division continues till the desired block size is obtained and allocation is made. When the pages allocated are finally freed, the buddies are coalesced if they are both free.

If you are interested in seeing how the blocks are currently organized, try the following command:

$ cat /proc/buddyinfo

It will give you the number of free blocks for all the orders starting from 0 in ascending order, for all the nodes and zones.

 

I’ll be back soon ūüôā

Swapping, latency measurements and Ftrace – Part 2

Part 1 spoke about the concept¬†of swapping. Now let’s roll up our sleeves and dive right into Ftrace.

First, we need to set up Ftrace. Follow the “Setting up Ftrace” section of¬†this¬†article and come back when you’re¬†done.

Ftrace gives you a lot of options to get the information you want, filtering out the rest. Let’s first look at the different kinds of tracers available. In the tracing directory, the contents of the file available_tracers will display the tracers you can use.

/sys/kernel/debug/tracing# cat available_tracers

blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop

We are primarily concerned with the function_graph tracer for analyzing swapping latencies. This tracer traces entries to and exits from functions and presents the results in a format that is easy to follow. Echoing the name of the tracer to the current_tracer file enables the tracer.

/sys/kernel/debug/tracing# echo function_graph > current_tracer

The file tracing_on is used to enable and disable recording of data. So if you want to make sure recording is enabled, use the following command

/sys/kernel/debug/tracing# cat tracing_on

If the output is 1, the tracer is recording data. If it is 0, you can enable it by echoing 1 to the file for the tracer to start recording.

If you want to understand the flow of execution of the kernel, you can study the output of the entire trace. However, if there are specific code paths you’d like to observe, you can have a look at the contents of the file available_filter_functions and pick a function to observe latencies in that specific function and the functions it may call.

For example,  when I wanted to see how long it took for the system to handle a page fault by swapping a page back in, I saw that the function __do_page_fault was present in the available_filter_functions file and with a bit of kernel code reading, I knew that it was called every time there was a page fault. So I enabled it by echoing the name of the function to the set_graph_function file.

/sys/kernel/debug/tracing# echo __do_page_fault > set_graph_function

How do I read the output of the trace?

The file ‘trace’ in the tracing directory contains the output. You can view it using the following command

/sys/kernel/debug/tracing# cat trace

The output of the trace can be overwhelming. So you can choose to view the top N lines of the file using this command:

/sys/kernel/debug/tracing# cat trace | head -N

Or depending on your preference, you could store the entire trace output in a file and look at it at leisure, like I sometimes do.

/sys/kernel/debug/tracing# cat trace > ~/name-of-the-file

This stores the trace output in the file name-of-the-file in the home directory.

This is what a small part of my output for __do_page_fault looks like. From this it is quite clear that, __do_page_fault() has called functions, down_read_trylock() which has taken 0.098 us to execute and _cond_resched() (0.087 us) and find_vma() (0.875 us), which in turn has called vmacache_find() (0.116 us)

2)                                            | __do_page_fault() {

2) ¬† ¬† ¬† ¬†0.098 us ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†| ¬† ¬† ¬† down_read_trylock()Õĺ

2) ¬† ¬† ¬† ¬†0.087 us ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†| ¬† ¬† ¬† ¬†_cond_resched()Õĺ

2)                                              |       find_vma() {

2) ¬† ¬† ¬† ¬†0.116 us ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† | ¬† ¬† ¬† ¬† ¬† ¬†vmacache_find()Õĺ

2)       0.875 us                        |         }

Now if I go all the way down to the end of this function graph, I can see how long __do_page_fault() has taken to execute (48.02 us). I have omitted the rest of the results here as they are way too detailed and long.

48.02 us may not be a very high latency. Much higher latencies can be triggered when anonymous memory is swapped out. Anonymous memory is memory which is not backed by a file (For example, memory allocated by malloc()). Latencies in the order of milliseconds are observed in such cases.

If ( unlikely(you are not too fond of the terminal)) or are better off using GUI when you have a choice, there’s KernelShark for you.¬†This¬†should serve as a good introduction.

If you are interested in knowing more about Ftrace, these are the places you want to check out:

  1. Secrets of the Ftrace function tracer
  2. Trace-cmd – A front end for Ftrace
  3. Measuring function duration with Ftrace
  4. Debugging the kernel using Ftrace – Part 1
  5. Debugging the kernel using Ftrace – Part 2
  6. And of course, the kernel documentation ūüôā