This is the third article in the series covering my learnings and experiments with the Interprocess Communication mechanisms in XNU. It is a direct follow-up to the introduction of the out-of-line data in Mach messages. Following the introduction, we’ll look at the properties of the transferred memory.

Previous parts of the series are:

Unless you’re already familiar with these concepts, I’d recommend reading the previous articles first.

Virtual memory and OOL data transfer

There are effectively 3 OOL descriptor options that impact the data transfer - copy, deallocate, and size. We’re going to look at the virtual memory properties when transferring data using different combinations of those options.

I’ve prepared several test cases that we will discuss in detail. This time, the focus is on behaviour analysis, so I won’t dive into the implementation details. The test cases also didn’t require any new Mach IPC concepts, and they’re built based on topics covered in the previous articles.

We’ll use proprietary Darwin userspace APIs to inspect the virtual memory properties. Covering how these APIs work isn’t relevant, so we won’t dive into that either. What’s most important are the few properties of virtual memory and OOL descriptor options:

  • Virtual memory properties:
    • share_mode: the way memory is shared - default is SM_PRIVATE for process local memory, can also be SM_COW to model copy-on-write (CoW) VM entries.
    • pages_resident: the number of resident pages - pages effectively occupying physical memory.
    • ref_count: number of references to the VM map entry.
  • The copy field in an OOL descriptor received by the client program.

VM properties

Before we dive into the analysis of the OOL data transfer, let’s have a look at some example states of virtual memory properties, to have a reference point.

Besides the proprietary APIs, we can also use vmmap CLI tool to inspect some of the virtual memory properties. Here’s an example output of the tool on a test program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ vmmap ool_memory_transfer_types_server
...

==== Non-writable regions for process 14188
REGION TYPE                    START - END         [ VSIZE  RSDNT  DIRTY   SWAP] PRT/MAX SHRMOD PURGE    REGION DETAIL
__TEXT                      109919000-10991d000    [   16K    16K     0K     0K] r-x/r-x SM=COW          /Users/USER/*/ool_memory_transfer_types_server

==== Writable regions for process 14188
REGION TYPE                    START - END         [ VSIZE  RSDNT  DIRTY   SWAP] PRT/MAX SHRMOD PURGE    REGION DETAIL
__DATA                      109921000-109925000    [   16K    16K     4K     0K] rw-/rw- SM=COW          /Users/USER/*/ool_memory_transfer_types_server
VM_ALLOCATE                 109968000-109969000    [    4K     4K     4K     0K] rw-/rwx SM=PRV
Stack                    7ffee5ae7000-7ffee62e7000 [ 8192K    20K    20K     0K] rw-/rwx SM=PRV          thread 0

There are a few things here relevant for our further analysis:

  • RSDNT is the total size of resident pages.
  • __TEXT/__DATA regions are mapped from a file, so they use a copy-on-write share mode SM=COW. Whenever there’s an attempt to modify these regions, they’re first copied to a new physical memory region.
  • VM_ALLOCATE/Stack regions use a private share mode SM=private. These regions are local to the process, so they don’t require copy-on-write.

I want to cover one more scenario before diving into OOL descriptor behaviour: the process forking. Whenever you fork a process, the operating system will copy over the virtual memory space from the parent to the child process, and in both, make the memory regions copy-on-write. This allows to limit the memory footprint, as memory is only copied when either of the processes writes to it.

I prepared a small test program to inspect the lifecycle and properties of a memory region when a process forks. The steps are as follows:

  1. Allocate a memory region using vm_map.
  2. Write some data to the allocated region.
  3. Fork.
  4. Write some data again to the same region in both processes. The parent will sleep for a second before performing this step.
  5. Put both to sleep for a while. It will guarantee that the child doesn’t terminate before the parent can inspect VM properties.

Making the processes sleep for a bit makes those actions sequential, so let’s have a look at the properties of virtual memory from the perspective of both processes and analyze it along the way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
1. Right after `vm_map`.

pages_resident: 0
ref_count: 0
share_mode: SM_EMPTY

2. After writing some data to the region.

pages_resident: 1
ref_count: 1
share_mode: SM_PRIVATE

3. After fork, from the child's perspective.

pages_resident: 1
ref_count: 2
share_mode: SM_COW

4. After the child writes some data again.

pages_resident: 1
ref_count: 1
share_mode: SM_PRIVATE

5. After sleep, from the parent's perspective.

pages_resident: 1
ref_count: 1
share_mode: SM_COW

6. After the parent writes some data again.

pages_resident: 1
ref_count: 1
share_mode: SM_PRIVATE
  1. We just allocated a page. It’s not used yet, so it’s uninitialized.
  2. We wrote some data, so now physical memory is occupied, and it’s reflected with the number of resident pages.
  3. We used fork. The region is marked SM_COW because both processes share the same underlying physical memory. That’s why also ref_count == 2.
  4. We’re back at SM_PRIVATE. This is because writing to this region triggered the copy-on-write, and a different physical memory now backs this region.
  5. The ref_count field has changed for the parent process, but the share mode is still copy-on-write. It’s likely an implementation detail, the region is still CoW, but if the ref_count is still 1 upon region write, there would be no need to make a copy.
  6. We’re back at the SM_PRIVATE. This presumably didn’t require an actual physical memory copy as the parent process was the only user of the region.

OOL VM memory behaviour

Now that we have some reference examples for the virtual memory properties, we can start analyzing the OOL data behaviour. So to recap, we will look at the Mach message copy option and virtual memory properties from the client’s perspective.

Each test case we’re going to analyze has a different combination of the copy, size and deallocate options. Important to highlight is that the size affects only the allocated and transferred memory size. In each test case, the server program writes only some data into the first memory page of the entire allocated region.

A good exercise might be to stop for a minute after going over the test cases and think about the outcomes you’d expect. Spoiler alert - several were surprising to me.

The test cases are:

Nr Name copy size deallocate
1 VIRTUAL;PAGE;NO_FREE MACH_MSG_VIRTUAL_COPY vm_page_size no
2 VIRTUAL;PAGE;FREE MACH_MSG_VIRTUAL_COPY vm_page_size yes
3 VIRTUAL;PAGEx16;NO_FREE MACH_MSG_VIRTUAL_COPY vm_page_size * 16 no
4 VIRTUAL;PAGEx16;FREE MACH_MSG_VIRTUAL_COPY vm_page_size * 16 yes
5 PHYSICAL;PAGE;NO_FREE MACH_MSG_PHYSICAL_COPY vm_page_size no
6 PHYSICAL;PAGE;FREE MACH_MSG_PHYSICAL_COPY vm_page_size yes
7 PHYSICAL;PAGEx16;NO_FREE MACH_MSG_PHYSICAL_COPY vm_page_size * 16 no
8 PHYSICAL;PAGEx16;FREE MACH_MSG_PHYSICAL_COPY vm_page_size * 16 yes

And here are example results for each from the client’s perspecitve:

Nr Name copy region start region end region size pages_resident ref_count share_mode
1 VIRTUAL;PAGE;NO_FREE VIRTUAL_COPY 0x104814000 0x104818000 0x4000 1 1 SM_PRIVATE
2 VIRTUAL;PAGE;FREE VIRTUAL_COPY 0x104814000 0x10481c000 0x8000 2 1 SM_PRIVATE
3 VIRTUAL;PAGEx16;NO_FREE VIRTUAL_COPY 0x1049fc000 0x104a3c000 0x40000 1 1 SM_PRIVATE
4 VIRTUAL;PAGEx16;FREE VIRTUAL_COPY 0x104a3c000 0x104a7c000 0x40000 1 1 SM_PRIVATE
5 PHYSICAL;PAGE;NO_FREE PHYSICAL_COPY 0x104814000 0x104820000 0xc000 3 1 SM_PRIVATE
6 PHYSICAL;PAGE;FREE PHYSICAL_COPY 0x104814000 0x104824000 0x10000 4 1 SM_PRIVATE
7 PHYSICAL;PAGEx16;NO_FREE PHYSICAL_COPY 0x114a7c000 0x114abc000 0x40000 16 1 SM_COW
8 PHYSICAL;PAGEx16;FREE PHYSICAL_COPY 0x114abc000 0x114afc000 0x40000 1 1 SM_PRIVATE

There are a few things here that were surprising to me, but let’s go over each of the tests.

  1. We made a copy of the smallest data transfer unit for OOL data - a single page. So SM_PRIVATE and a single ref_count makes sense. Based on the Mach messages documentation and what we’ve covered earlier, I expected the kernel to make a physical memory copy. However, the MACH_MSG_VIRTUAL_COPY copy option was surprising to me. If it is indeed a case that a physical copy was performed, I thought this would be reflected on the client-side. So either there was no physical memory copy, or the copy field on the receiver’s side only reflects the senders intent, not the actual action.

  2. Because we used deallocate, there’s no need to copy the memory. It’s only a matter of moving the VM mapping entry. This case is as I expected. Note that the number of resident pages is 2. That’s because the address selected by the kernel was adjacent to the previous region and they were merged into one, as you can see in the address range.

  3. This is another interesting scenario. We’ve increased the number of pages, but the share mode is still private, and increasing the number of pages didn’t affect it. I expected the share mode to be copy-on-write. Making a physical copy of 10 pages isn’t as cheap anymore, so I thought it’d be the same as when forking a process.

  4. It is again as expected, as we used deallocate. There’s not even a need for copy-on-write.

  5. We’ve requested a physical copy, and we have a private memory region with a resident page. So nothing surprising here. Like in the #2 case, the memory regions were merged.

  6. This is the same scenario as #5.

  7. I didn’t expect to see a copy-on-write share mode with MACH_MSG_PHYSICAL_COPY, but the ref_count is 1, and all the pages of the regions are resident. This could be just a side-effect from the kernel performing the data copy.

  8. The share mode makes sense. Again we used the deallocate option, so the virtual memory mapping just had to be moved. I wasn’t sure what to expect from resident pages. On the one hand, this makes sense. Why load the memory if there’s no need to copy the data. On the other hand, the behaviour is just the same as in case #4, but the copy option values end up different on the receiver’s side.

copy option in the receiver

Based on cases #1, #4 and #8, it seems reasonable to conclude that the copy OOL descriptor field on the receiver’s side reflects only the sender’s intent, not the actual memory operation performed by the kernel. I had a look at the XNU sources to see if I could confirm this. ipc_kmsg_copyout_ool_descriptor function responsible for copying the OOL descriptor and data from the kernel’s buffer to the receiver process doesn’t modify the copy field.
Here’s also a code snippet of the ipc_kmsg_copyin_ool_descriptor function that copies the OOL descriptor from the sender to the kernel’s buffer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static mach_msg_descriptor_t *
ipc_kmsg_copyin_ool_descriptor(
	mach_msg_ool_descriptor_t *dsc,
	mach_msg_descriptor_t *user_dsc,
	...)
{
  ...
  /*
   * Make a vm_map_copy_t of the of the data. If the
   * data is small, this will do an optimized physical
   * copy. Otherwise, it will do a virtual copy.
   *
   * NOTE: A virtual copy is OK if the original is being
   * deallocted, even if a physical copy was requested.
   */
  kern_return_t kr = vm_map_copyin(map, addr,
      (vm_map_size_t)length, dealloc, copy);
  if (kr != KERN_SUCCESS) {
    *mr = (kr == KERN_RESOURCE_SHORTAGE) ?
        MACH_MSG_VM_KERNEL :
        MACH_SEND_INVALID_MEMORY;
    return NULL;
  }
  dsc->address = (void *)*copy;
}

The comment explicitly states that vm_map_copyin may choose to do an optimized physical copy. The destination address where memory was copied - copy, is saved into the dsc OOL descriptor. It doesn’t seem like vm_map_copyin even provides information on whether a virtual or physical memory copy occurred, so it’d explain why this isn’t reflected.

Resident pages in physical copy

In case #8 the number of resident pages wasn’t obvious up front, mainly because the mach_msg documentation for MACH_MSG_VIRTUAL_COPY has the following statement:

Receivers concerned about deterministic access time should also exercise caution.

So I thought using MACH_MSG_PHYSICAL_COPY would force the kernel to allocate the necessary physical memory and thus guarantee deterministic access time by eliminating later page faults. But this isn’t the case and is also explicitly documented in the previous code snippet:

1
2
* NOTE: A virtual copy is OK if the original is being
* deallocted, even if a physical copy was requested.

The mach_msg documentation isn’t quite up-to-date, and it also references more advanced memory retrieval options, which are deprecated and unused. Maybe using MACH_MSG_PHYSICAL_COPY used to guarantee deterministic access time at some point consistently, but it’s not the case now.

Where is copy-on-write?

The remaining unclear case is #3. We’ve requested a virtual copy of a significant amount of memory, but the memory region in the client process has a private share mode. Even the virtual memory region has a single ref_count. It seems like there’re now two physical copies of the data.

To further analyze this scenario, I’ve added a similar test case with a larger memory size - 256MB. In this case, the server program writes data to all allocated pages to reach the total memory usage.
Here again, the virtual memory properties were the same. So I checked again at the output of vmmap CLI tool specifically for those regions:

1
2
3
4
5
6
7
8
9
$ vmmap ool_memory_transfer_types_server
...
REGION TYPE                    START - END         [ VSIZE  RSDNT  DIRTY   SWAP] PRT/MAX SHRMOD PURGE    REGION DETAIL
VM_ALLOCATE                 104f8c000-114f8c000    [256.0M 256.0M 256.0M     0K] rw-/rwx SM=PRV

$ vmmap ool_memory_transfer_types_client
...
REGION TYPE                    START - END         [ VSIZE  RSDNT  DIRTY   SWAP] PRT/MAX SHRMOD PURGE    REGION DETAIL
VM_ALLOCATE                 102318000-112318000    [256.0M 256.0M 256.0M     0K] rw-/rwx SM=PRV

For both server and client, the regions have the same properties.
vmmap not only reports the virtual memory mapping, but it also shows the total physical footprint, and there’s a difference:

1
2
3
4
5
6
7
8
9
$ vmmap ool_memory_transfer_types_server
...
Physical footprint:         257.2M
Physical footprint (peak):  257.2M

$ vmmap ool_memory_transfer_types_client
...
Physical footprint:         1041K
Physical footprint (peak):  1041K

Those measurements don’t seem to match the virtual memory properties, as both programs have a region with 256MB of dirty and resident memory. The very low memory usage of the client process doesn’t exactly add up.

Next, I’ve tried the same scenario, but with the deallocate option. The peak memory usage of the server was again around 256MB, and the current usage has dropped. The client’s memory usage didn’t change even though it still had access to the 256MB of mapped data.

I’ve also further increased the amount of data to 2GB to easier track the memory and check the total memory usage reported by the Activity Monitor app. The Memory Used metric confirmed what seemed to be the case - after the data exchange, the total memory usage increased by ~4GB. However, Activity Monitor also reported a very low memory usage for the client process.

Large memory copy

The final copy-on-write test was to check what happens when the client starts writing data to the newly received memory. I’ve prepared a separate example for this test scenario - ool_virtual_copy_large.

The server program sends two messages with out-of-line data, ~2GB each. The client program modifies all pages of only one of those memory regions. In this scenario numbers reported by vmmap are:

1
2
3
4
5
6
$ vmmap ool_virtual_copy_large_server | grep -i physi
Physical footprint:         4.0G
Physical footprint (peak):  4.0G
$ vmmap ool_virtual_copy_large_client | grep -i physi
Physical footprint:         2.0G
Physical footprint (peak):  2.0G

The client writes to only one of the two regions, and finally, we can see its memory usage has increased. The total memory usage reported by Activity Monitor jumped from ~25GB to ~33GB, indicating that again each of the programs is effectively using around 4GB of memory. Client writing data to the region didn’t affect the overall memory usage, so it looks like the per-process memory usage tracking isn’t precise.

I suspect this could be related to how the kernel transfers the OOL data between processes. Mach messages are a message buffer type of IPC. Messages aren’t directly exchanged between programs but buffered in the kernel. That’s why the ipc_kmsg_copyin_ool_descriptor also has companion ipc_kmsg_copyout_ool_descriptor routine. There’s an intermediate step where the virtual memory mapping information resides only in the kernel memory. After writing data to the retrieved region, the reported memory usage of the client process increased, but the total memory usage didn’t. It seems like the way the kernel transferred memory to the client process didn’t immediately affect the client’s reported memory usage. However, I don’t yet have quite enough understanding of the kernel internals to confirm that the Mach message buffering is causing this memory reporting behaviour.

To summarize:

  • In the virtual memory properties, each of the processes:
    • Has their region with ref_count = 1.
    • The regions have a private share mode.
  • The reported total memory usage has increased by 8GB.

Thus, it seems like both processes have their own physical memory copy, and copy-on-write isn’t used, which is quite surprising. I expected CoW to kick in and be visible similar to the forking scenario.

Summary

We looked at different options of transferring out-of-line data over Mach message and how these options affect virtual memory. Many of the scenarios were as expected or were just slightly different. The biggest surprise was that I didn’t notice any use of the copy-on-write share mode with the transferred memory regions. With the overall memory usage increases, it seems like the client ended up having its physical memory copies of data.

Not leveraging CoW when exchanging large chunks of data would be rather wasteful for memory resources. Maybe another CoW mechanism exists, and it’s implemented in a way where SM_COW isn’t reported? It’d be interesting in the future to dive into the kernel implementation and see if and when OOL data transfers could use CoW. If you happen to know more about this, please reach out.

Full implementation of the test cases can be found at GitHub.


Thanks for reading! If you’ve got any questions, comments, or general feedback, you can find all my social links at the bottom of the page.