XNU IPC - OOL data and virtual memory
Contents
This is the third article in the series covering my learnings and experiments with the Interprocess Communication mechanisms in XNU. It is a direct follow-up to the introduction of the out-of-line data in Mach messages. Following the introduction, we’ll look at the properties of the transferred memory.
Previous parts of the series are:
Unless you’re already familiar with these concepts, I’d recommend reading the previous articles first.
Virtual memory and OOL data transfer
There are effectively 3 OOL descriptor options that impact the data transfer -
copy
, deallocate
, and size
. We’re going to look at the virtual
memory properties when transferring data using different combinations of those options.
I’ve prepared several test cases that we will discuss in detail. This time, the focus is on behaviour analysis, so I won’t dive into the implementation details. The test cases also didn’t require any new Mach IPC concepts, and they’re built based on topics covered in the previous articles.
We’ll use proprietary Darwin userspace APIs to inspect the virtual memory properties. Covering how these APIs work isn’t relevant, so we won’t dive into that either. What’s most important are the few properties of virtual memory and OOL descriptor options:
- Virtual memory properties:
- share_mode: the way memory is shared - default is
SM_PRIVATE
for process local memory, can also beSM_COW
to model copy-on-write (CoW) VM entries. - pages_resident: the number of resident pages - pages effectively occupying physical memory.
- ref_count: number of references to the VM map entry.
- share_mode: the way memory is shared - default is
- The
copy
field in an OOL descriptor received by the client program.
VM properties
Before we dive into the analysis of the OOL data transfer, let’s have a look at some example states of virtual memory properties, to have a reference point.
Besides the proprietary APIs, we can also use vmmap
CLI tool to inspect some of
the virtual memory properties.
Here’s an example output of the tool on a test program:
|
|
There are a few things here relevant for our further analysis:
RSDNT
is the total size of resident pages.__TEXT
/__DATA
regions are mapped from a file, so they use a copy-on-write share modeSM=COW
. Whenever there’s an attempt to modify these regions, they’re first copied to a new physical memory region.VM_ALLOCATE
/Stack
regions use a private share modeSM=private
. These regions are local to the process, so they don’t require copy-on-write.
I want to cover one more scenario before diving into OOL descriptor behaviour:
the process forking. Whenever you fork
a process, the operating system
will copy over the virtual memory space from the parent to the child
process, and in both, make the memory regions copy-on-write. This allows to limit
the memory footprint, as memory is only copied when either of the processes
writes to it.
I prepared a small test program to inspect the lifecycle and properties of a memory region when a process forks. The steps are as follows:
- Allocate a memory region using
vm_map
. - Write some data to the allocated region.
- Fork.
- Write some data again to the same region in both processes. The parent will sleep for a second before performing this step.
- Put both to sleep for a while. It will guarantee that the child doesn’t terminate before the parent can inspect VM properties.
Making the processes sleep for a bit makes those actions sequential, so let’s have a look at the properties of virtual memory from the perspective of both processes and analyze it along the way:
|
|
- We just allocated a page. It’s not used yet, so it’s uninitialized.
- We wrote some data, so now physical memory is occupied, and it’s reflected with the number of resident pages.
- We used
fork
. The region is markedSM_COW
because both processes share the same underlying physical memory. That’s why alsoref_count == 2
. - We’re back at
SM_PRIVATE
. This is because writing to this region triggered the copy-on-write, and a different physical memory now backs this region. - The
ref_count
field has changed for the parent process, but the share mode is still copy-on-write. It’s likely an implementation detail, the region is still CoW, but if theref_count
is still1
upon region write, there would be no need to make a copy. - We’re back at the
SM_PRIVATE
. This presumably didn’t require an actual physical memory copy as the parent process was the only user of the region.
OOL VM memory behaviour
Now that we have some reference examples for the virtual memory properties, we can
start analyzing the OOL data behaviour.
So to recap, we will look at the Mach message copy
option and virtual memory
properties from the client’s perspective.
Each test case we’re going to analyze has a different combination of
the copy
, size
and deallocate
options. Important to highlight is that the
size
affects only the allocated and transferred memory size. In each test case,
the server program writes only some data into the first memory page of the entire
allocated region.
A good exercise might be to stop for a minute after going over the test cases and think about the outcomes you’d expect. Spoiler alert - several were surprising to me.
The test cases are:
Nr | Name | copy | size | deallocate |
---|---|---|---|---|
1 | VIRTUAL;PAGE;NO_FREE | MACH_MSG_VIRTUAL_COPY | vm_page_size | no |
2 | VIRTUAL;PAGE;FREE | MACH_MSG_VIRTUAL_COPY | vm_page_size | yes |
3 | VIRTUAL;PAGEx16;NO_FREE | MACH_MSG_VIRTUAL_COPY | vm_page_size * 16 | no |
4 | VIRTUAL;PAGEx16;FREE | MACH_MSG_VIRTUAL_COPY | vm_page_size * 16 | yes |
5 | PHYSICAL;PAGE;NO_FREE | MACH_MSG_PHYSICAL_COPY | vm_page_size | no |
6 | PHYSICAL;PAGE;FREE | MACH_MSG_PHYSICAL_COPY | vm_page_size | yes |
7 | PHYSICAL;PAGEx16;NO_FREE | MACH_MSG_PHYSICAL_COPY | vm_page_size * 16 | no |
8 | PHYSICAL;PAGEx16;FREE | MACH_MSG_PHYSICAL_COPY | vm_page_size * 16 | yes |
And here are example results for each from the client’s perspecitve:
Nr | Name | copy | region start | region end | region size | pages_resident | ref_count | share_mode |
---|---|---|---|---|---|---|---|---|
1 | VIRTUAL;PAGE;NO_FREE | VIRTUAL_COPY | 0x104814000 | 0x104818000 | 0x4000 | 1 | 1 | SM_PRIVATE |
2 | VIRTUAL;PAGE;FREE | VIRTUAL_COPY | 0x104814000 | 0x10481c000 | 0x8000 | 2 | 1 | SM_PRIVATE |
3 | VIRTUAL;PAGEx16;NO_FREE | VIRTUAL_COPY | 0x1049fc000 | 0x104a3c000 | 0x40000 | 1 | 1 | SM_PRIVATE |
4 | VIRTUAL;PAGEx16;FREE | VIRTUAL_COPY | 0x104a3c000 | 0x104a7c000 | 0x40000 | 1 | 1 | SM_PRIVATE |
5 | PHYSICAL;PAGE;NO_FREE | PHYSICAL_COPY | 0x104814000 | 0x104820000 | 0xc000 | 3 | 1 | SM_PRIVATE |
6 | PHYSICAL;PAGE;FREE | PHYSICAL_COPY | 0x104814000 | 0x104824000 | 0x10000 | 4 | 1 | SM_PRIVATE |
7 | PHYSICAL;PAGEx16;NO_FREE | PHYSICAL_COPY | 0x114a7c000 | 0x114abc000 | 0x40000 | 16 | 1 | SM_COW |
8 | PHYSICAL;PAGEx16;FREE | PHYSICAL_COPY | 0x114abc000 | 0x114afc000 | 0x40000 | 1 | 1 | SM_PRIVATE |
There are a few things here that were surprising to me, but let’s go over each of the tests.
-
We made a copy of the smallest data transfer unit for OOL data - a single page. So
SM_PRIVATE
and a singleref_count
makes sense. Based on the Mach messages documentation and what we’ve covered earlier, I expected the kernel to make a physical memory copy. However, theMACH_MSG_VIRTUAL_COPY
copy option was surprising to me. If it is indeed a case that a physical copy was performed, I thought this would be reflected on the client-side. So either there was no physical memory copy, or thecopy
field on the receiver’s side only reflects the senders intent, not the actual action. -
Because we used
deallocate
, there’s no need to copy the memory. It’s only a matter of moving the VM mapping entry. This case is as I expected. Note that the number of resident pages is2
. That’s because the address selected by the kernel was adjacent to the previous region and they were merged into one, as you can see in the address range. -
This is another interesting scenario. We’ve increased the number of pages, but the share mode is still private, and increasing the number of pages didn’t affect it. I expected the share mode to be copy-on-write. Making a physical copy of 10 pages isn’t as cheap anymore, so I thought it’d be the same as when forking a process.
-
It is again as expected, as we used
deallocate
. There’s not even a need for copy-on-write. -
We’ve requested a physical copy, and we have a private memory region with a resident page. So nothing surprising here. Like in the #2 case, the memory regions were merged.
-
This is the same scenario as #5.
-
I didn’t expect to see a copy-on-write share mode with
MACH_MSG_PHYSICAL_COPY
, but theref_count
is1
, and all the pages of the regions are resident. This could be just a side-effect from the kernel performing the data copy. -
The share mode makes sense. Again we used the
deallocate
option, so the virtual memory mapping just had to be moved. I wasn’t sure what to expect from resident pages. On the one hand, this makes sense. Why load the memory if there’s no need to copy the data. On the other hand, the behaviour is just the same as in case #4, but the copy option values end up different on the receiver’s side.
copy
option in the receiver
Based on cases #1, #4 and #8, it seems reasonable to conclude that the copy
OOL descriptor field on the receiver’s side reflects only the sender’s intent,
not the actual memory operation performed by the kernel.
I had a look at the XNU sources to see if I could confirm this.
ipc_kmsg_copyout_ool_descriptor
function responsible for copying the OOL
descriptor and data from the kernel’s buffer to the receiver process doesn’t modify
the copy
field.
Here’s also a code snippet of the ipc_kmsg_copyin_ool_descriptor
function
that copies the OOL descriptor from the sender to the kernel’s buffer:
|
|
The comment explicitly states that vm_map_copyin
may choose to do an optimized
physical copy. The destination address where memory was copied - copy
, is saved
into the dsc
OOL descriptor. It doesn’t seem like vm_map_copyin
even provides
information on whether a virtual or physical memory copy occurred, so it’d
explain why this isn’t reflected.
Resident pages in physical copy
In case #8 the number of resident pages wasn’t obvious up front, mainly because
the mach_msg
documentation for MACH_MSG_VIRTUAL_COPY
has the following statement:
Receivers concerned about deterministic access time should also exercise caution.
So I thought using MACH_MSG_PHYSICAL_COPY
would force the kernel to allocate
the necessary physical memory and thus guarantee deterministic access time by
eliminating later page faults. But this isn’t the case and is also explicitly
documented in the previous code snippet:
|
|
The mach_msg
documentation isn’t quite up-to-date, and it also references more
advanced memory retrieval options, which are deprecated and unused. Maybe using
MACH_MSG_PHYSICAL_COPY
used to guarantee deterministic access time at some
point consistently, but it’s not the case now.
Where is copy-on-write?
The remaining unclear case is #3. We’ve requested a virtual copy of a significant
amount of memory, but the memory region in the client process has a private
share mode. Even the virtual memory region has a single ref_count
.
It seems like there’re now two physical copies of the data.
To further analyze this scenario, I’ve added a similar test case with a
larger memory size - 256MB. In this case, the server program writes data to
all allocated pages to reach the total memory usage.
Here again, the virtual memory properties were the same. So I checked
again at the output of vmmap
CLI tool specifically for those regions:
|
|
For both server and client, the regions have the same properties.
vmmap
not only reports the virtual memory mapping, but it also shows the total
physical footprint, and there’s a difference:
|
|
Those measurements don’t seem to match the virtual memory properties, as both programs have a region with 256MB of dirty and resident memory. The very low memory usage of the client process doesn’t exactly add up.
Next, I’ve tried the same scenario, but with the deallocate
option. The peak memory
usage of the server was again around 256MB, and the current usage has dropped.
The client’s memory usage didn’t change even though it still had access to the
256MB of mapped data.
I’ve also further increased the amount of data to 2GB to easier track the memory
and check the total memory usage reported by the Activity Monitor
app.
The Memory Used
metric confirmed what seemed to be the case - after the data
exchange, the total memory usage increased by ~4GB. However, Activity Monitor
also reported a very low memory usage for the client process.
Large memory copy
The final copy-on-write test was to check what happens when the client starts writing data to the newly received memory. I’ve prepared a separate example for this test scenario - ool_virtual_copy_large.
The server program sends two messages with out-of-line data, ~2GB each.
The client program modifies all pages of only one of those memory regions.
In this scenario numbers reported by vmmap
are:
|
|
The client writes to only one of the two regions, and finally, we can see
its memory usage has increased.
The total memory usage reported by Activity Monitor
jumped from ~25GB to
~33GB, indicating that again each of the programs is effectively using around
4GB of memory. Client writing data to the region didn’t affect the overall
memory usage, so it looks like the per-process memory usage tracking isn’t precise.
I suspect this could be related to how the kernel transfers the OOL data between
processes. Mach messages are a message buffer type of IPC. Messages aren’t
directly exchanged between programs but buffered in the kernel.
That’s why the ipc_kmsg_copyin_ool_descriptor
also has companion
ipc_kmsg_copyout_ool_descriptor
routine. There’s an intermediate step where
the virtual memory mapping information resides only in the kernel memory.
After writing data to the retrieved region, the reported memory usage
of the client process increased, but the total memory usage didn’t. It seems like
the way the kernel transferred memory to the client process didn’t immediately
affect the client’s reported memory usage.
However, I don’t yet have quite enough understanding of the kernel internals to confirm
that the Mach message buffering is causing this memory reporting behaviour.
To summarize:
- In the virtual memory properties, each of the processes:
- Has their region with
ref_count = 1
. - The regions have a private share mode.
- Has their region with
- The reported total memory usage has increased by 8GB.
Thus, it seems like both processes have their own physical memory copy, and copy-on-write isn’t used, which is quite surprising. I expected CoW to kick in and be visible similar to the forking scenario.
Summary
We looked at different options of transferring out-of-line data over Mach message and how these options affect virtual memory. Many of the scenarios were as expected or were just slightly different. The biggest surprise was that I didn’t notice any use of the copy-on-write share mode with the transferred memory regions. With the overall memory usage increases, it seems like the client ended up having its physical memory copies of data.
Not leveraging CoW when exchanging large chunks of data would be
rather wasteful for memory resources. Maybe another CoW mechanism exists, and
it’s implemented in a way where SM_COW
isn’t reported? It’d be interesting in the
future to dive into the kernel implementation and see if and when OOL data
transfers could use CoW.
If you happen to know more about this, please reach out.
Full implementation of the test cases can be found at GitHub.
Thanks for reading! If you’ve got any questions, comments, or general feedback, you can find all my social links at the bottom of the page.