1. Design The goal of vma mirroring is to allow for creating special file mappings where a given set of physical pages (those backing the file) is visible at two different linear addresses in a given task. Furthermore the mirroring logic ensures that the two mappings in linear address space will see the same physical pages even after they go through a swap-out/swap-in cycle or copy-on-write. While vma mirroring is a generic idea, PaX uses it for very specific purposes and therefore the implementation is a bit less generic than it could be (but it results in simpler and less intrusive changes). The first use is for mirroring executable regions into the code segment under SEGMEXEC. In this case the 3 GB userland linear address space is divided into two halves of 1.5 GB each and the code/data segment descriptors are modified to cover only one or the other. To be able to execute code under this setup we have to ensure that executable mappings are visible in the code segment region (1.5-3 GB range in linear address space). Since such executable mappings may contain data as well (constant strings, function pointer tables, etc), we have to have a mirror of these mappings at the same logical addresses in the data segment as well (0-1.5 GB range in linear address space). The nice property of this setup is that a pair of mirrored regions will have a constant difference between their start/end addresses: 1.5 GB (or SEGMEXEC_TASK_SIZE as it is often referenced in the code). The second use of vma mirroring is to implement the mirror of the main executable at a randomized address under RANDEXEC. Here again we will have a constant (task specific) difference between the mirrored regions and can simplify the implementation the same way as under SEGMEXEC. There is also an implicit third situation, when both SEGMEXEC and RANDEXEC are active for a task. At a first look this may appear very complex since the executable region of the main executable would have to be mirrored at three places instead of one: randomized mappings in both the data and the code segment (at the same logical addresses) plus a mirror into the code segment at the original logical address. Luckily, we can save on two of them: the randomized mapping in the data segment is not needed as we do not expect code to reference data in its executable segment in a position independent manner (which is what would be needed for code to learn its own location), second we do not need the original mapping mirrored in the code segment as we explicitly do not want it to be executable (so that code references to this region would raise a page fault and the RANDEXEC logic could then react on it). 2. Implementation vma mirroring requires two basic changes to the VM in Linux. First we have to provide an interface for setting up the mirrors, second we have to maintain the synchronicity between the mirrored regions' linear/physical mappings. Linux maintains a per task database of what is present in the given task's address space. This database is a set of structures called vm_area_struct each of which describes a single mapping (defined in include/linux/mm.h). The database for a task can be viewed in /proc//maps. For our purposes the relevance of the vma database is that it directly guides the page fault resolution logic which in turn is responsible for setting up the linear to physical address translation on a per page basis. To understand how all this works, consider task creation and its first moment of life in userland. For ELF executables the load_elf_binary() function in fs/binfmt_elf.c is responsible for populating the task's address space with a few basic mappings such as the stack, the dynamic linker (if the application in question is dynamically linked) and the main executable itself. The file mappings are established by using the kernel's internal do_mmap() interface through a simple wrapper called elf_map(). Note that at this point only the stack region has physical pages assigned (since that is where arguments, the environment, etc go and therefore must be present in physical memory at this early stage), the file mappings are not yet backed by physical memory pages. When the task begins its life in userland, the very first instruction fetch in ld.so or the main executable will raise a page fault since the Linux VM system does not establish a valid physical mapping until it is actually needed (i.e., it is demand based). The first thing the architecture specific page fault handler (for i386 it is do_page_fault() in arch/i386/mm/fault.c) does is to find the vma structure that describes the region in which the page fault happened then call the architecture independent handler (handle_mm_fault() in mm/memory.c) which based on the fault and the vma type will call the appropriate function to establish a physical page containing the expected data (in our case it would be read from the file backing the mapping, that is, somewhere from the .text section in an ELF file). The interface for setting up a vma mirror is a simple extension to the already existing memory mapping interface. This interface is accessible from userland as the mmap() library call. Since the vma mirroring facility is meant to be used by specific PaX features only, userland initiated vma mirroring requests are not allowed (that is why PaX returns an error from do_mmap2() in arch/i386/kernel/sys_i386.c). Care must be taken however for handling mmap() requests from tasks running under SEGMEXEC. This is because they can create executable file mappings and therefore they must be mirrored just like when the kernel itself establishes the initial file mappings as discussed above. Since all mmap() requests go through do_mmap() (an inline function defined in include/linux/mm.h) this is where PaX requests the extra mirrored mappings for SEGMEXEC executables. Since do_mmap2() originally gets around do_mmap() by directly calling do_mmap_pgoff(), we modified it to use do_mmap() instead. This way we can ensure that the SEGMEXEC logic gets to see both userland and kernel originated file mapping requests. vma mirror requests use special arguments for calling do_mmap_pgoff() in the end: 'file' must be NULL because the mirror will reference the same file as the vma to be mirrored 'addr' has its normal meaning of specifying a hint for searching a suitable hole in the address space where the mapping can go, 'len' must be 0 because it will be derived from the vma that is about to get mirrored, 'prot' has its normal meaning, 'flags' has its normal meaning except that it must also specify the new MAP_MIRROR flag and it must request a private mapping, 'pgoff' specifies the linear start address of the vma to be mirrored (note that here it is measured in bytes and not PAGE_SIZE units). The vma to be mirrored must exist at the specified start address ('pgoff') and must not be mirrored or be a mirror itself already. Furthermore PaX will not allow a writable mirror for a read-only vma. Note that these are only sanity checks to detect early if there is a bug in the rest of the vma mirroring logic (denied mirror requests will end up in a non-functioning task and are therefore easy to see for an end user). The second basic change needed for implementing vma mirroring is in the MMU state management logic (which governs the linear/physical translation). Our goal is simple: whenever the state of a mirrored page changes we will have to propagate the change into the state of the mirror page as well (and do all this atomically, that is, other state changing code must be locked out until we finished). Such state changes occur in the following operations: page fault servicing, munmap, mremap, mprotect, mlock and vma merging. Servicing a fault means that the kernel finds out why a page fault occured and when it is valid (it occured in a region described by a vma having proper access rights) it will allocate storage in physical memory and set up a valid linear/physical translation in the MMU (on i386 it means setting up a present pte). While the page fault classification (valid/invalid) is done in architecture specific code, the actual service needs no longer to care about such details and is architecture independent: handle_mm_fault(), therefore this is what we have to modify. Also note that handle_mm_fault() is used by other code as well that we would otherwise have to explicitly modify (get_user_page() for example that is used by ptrace() among others). The strategy for servicing a page fault in a mirrored vma is the following: first we do some sanity checks on the mirror's vma (again in order to detect potential bugs in the implementation early) then allocate the necessary MMU resources (various levels of paging structures) so that by the time we get to propagate the MMU state information, we will not have to worry about resource allocation failures. After successful resource allocation we let the original fault handling logic carry out its work (swap-in a page, do copy-on-write, etc) and intervene when it has just established the new MMU state for the mirror: we call the core of the vma mirroring logic in pax_mirror_fault() in mm/memory.c. To simplify the logic of the mirroring code, we established a simple naming convention for variables related to one or the other vma: the vma for which handle_mm_fault() was called is said to be mirrored and the corresponding variables have no suffixes, whereas the other vma is called the mirror and its variables are suffixed by _m. For example, vma_m is the vm_area_struct pointer of the mirror vma. pax_mirror_fault() first determines if it has anything to do in the first place and if so, it looks up the mirror vma and associated information, such as the mirror of the fault address and the related MMU structures (we are interested in the page table entry as it contains the physical page number that will have to be synchronized between the mirrors). Once the mirror's pte is known, we have to see if it currently specifies a valid mapping and if so, we have to invalidate it (and while handling the different cases, we also take care of the resident set size: we have to increment it if the mapping was not valid since it will be after mirroring). Invalidating the current mapping in the mirror is derived from kernel code doing the same: the munmap() and swap-out operations. The next and final step is to actually propagate the new linear/physical mapping into the mirror: we look up the physical page in the mirrored pte (and increment its use count since we are going to create another reference to it in the mirror's pte) then construct the mirror's pte from it and the appropriate access rights (the writability state must be copied verbatim from the mirrored pte otherwise we would ruin the copy-on-write logic). The atomicity of all the above actions is ensured by holding the appropriate page_table_lock on entry and never releasing it inside. This way the higher level callers (who establish the mirrored pte) ensure that the mirror's pte is established at the same time. The last set of vma mirroring related changes ensures that userland can modify/destroy mirrored regions only along with the corresponding mirrors (creation was described in the do_mmap() changes). The most complex change is in the munmap() logic which is responsible for destroying all kinds of mappings. To understand the changes let's first look at how it works in the standard kernel. The core function is do_munmap() in mm/memory.c which begins by doing some checks on the area to be removed then proceeds with moving all the vma structures that fall in there (fully or in part) from the mm's vma list to a special one. In the next phase this list is processed for clearing all linear/physical mappings in the MMU for each vma. The final step is to free up page tables that may have become empty and are no longer needed. Mirror handling requires two changes in the above logic. First, we have to detect if any to be removed vma is mirrored and if so we have to move its mirror vma to another special list (during the same atomic operation as it is done originally). Second, we have to clear the corresponding mappings in the MMU for this list of vma structures as well. While setting up the second special list is straightforward, the second step is not as the original kernel code is rather badly organized and does not lend itself for easy reuse. In order to avoid unnecessary code replication we opted for rearranging the original code a bit through simple program transformations: the MMU cleanup logic has been split into unmap_vma_list() and unmap_vma(). This way processing the second special list can be done in unmap_vma_mirror_list() which makes use of unmap_vma(). There is one last trick worth noting: map_count handling. This counter has to be decremented for each vma which gets unmapped. The original do_munmap() logic delegates this task to unmap_vma(). The problem with it is that at this point all the vma structures have already been removed from the main mm vma list however this counter is decremented one by one for each vma. This in turn will trigger a kernel BUG message because of the inconsistency between map_count and the actual number of vma structures on the mm vma list. Since this is a kernel BUG (or 'feature', after all at the end of do_munmap() everything will be in synch again), we decided to modify the original kernel logic so that map_count gets decremented during the special lists preparation phase. The next change is in do_mremap() in mm/mremap.c where we ensure that mirrored regions simply cannot be remapped (they can shrink however as it simply means a call to do_munmap() which handles mirrors fine). The remaining two userland interfaces are handled the same way because both mprotect() and mlock() have the same internal logic: enumerate all mappings in the given range and then act on each one of them individually. The functions we modify are mprotect_fixup() and mlock_fixup(), respectively. First the original functions are moved to __mprotect_fixup() and __mlock_fixup() then they are called for each vma in a mirror when one is encountered. The vma merging mechanism is governed by the inline can_vma_merge() function in include/linux/mm.h. PaX modifies this function to prevent anonymous mirror mappings from getting inadvertantly merged with others (file mappings are never merged). 3. Examples To help better understand vma mirroring, we present a few address space layouts and explain what happened there. In each case we used a copy of /bin/cat in /tmp to execute "/tmp/cat /proc/self/maps". Note that for the sake of simplicity we disabled RANDMMAP, this of course should not be done in production systems. The [x] marks are not part of the original output, we use them to refer to the various lines in the explanation. Active PaX features: SEGMEXEC and MPROTECT [1] 08048000-0804a000 R-Xp 00000000 00:0b 1109 /tmp/cat [2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat [3] 0804b000-0804d000 RW-p 00000000 00:00 0 [4] 20000000-20015000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so [5] 20015000-20016000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so [6] 2001e000-20143000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so [7] 20143000-20149000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so [8] 20149000-2014d000 RW-p 00000000 00:00 0 [9] 5fffe000-60000000 RW-p fffff000 00:00 0 [10] 68048000-6804a000 R-Xp 00000000 00:0b 1109 /tmp/cat [11] 80000000-80015000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so [12] 8001e000-80143000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so Since cat is a dynamically linked executable, its address space will have several file mappings besides the main executable. Let's see what each line represents. [1] is the first PT_LOAD segment of the /tmp/cat ELF file, it is mapped with R-X rights, that is, it contains the executable code plus all read-only initialized data as well. It is also mirrored by [10] because it is executable. [2] is the second PT_LOAD segment of the /tmp/cat ELF file, it is mapped with RW- rights, that is, it contains writable data (all initialized and the beginning of the uninitialized data, in our case all of it as they fit into a single page). [3] is the brk() managed heap (its size changes at runtime as cat calls malloc()/free()/etc). Note that if cat had more uninitialized data than what would fit into the gap left on the last page of mapping [2] then the rest would be mapped here from the beginning of [3] and the brk() managed heap would follow then. [4] and [5] are the PT_LOAD segments of the dynamic linker, whereas [6] and [7] are those of the C library. [4] and [6] are also mirrored by [11] and [12] respectively as they are executable. [8] is an anonymous mapping corresponding to uninitialized data in the C library (if we take a look at the ELF program headers of libc, we will see that the memory size of the second PT_LOAD segment is 4 pages more than its file size). [9] is another anonymous mapping containing the stack. We can observe that it is at the end of the userland address space (which under SEGMEXEC is at TASK_SIZE/2) and grows downwards. [10], [11] and [12] are the mirrors of the executable file mappings [1], [4] and [6] respectively (notice that each pair has exactly TASK_SIZE/2 "distance"). They are all above the TASK_SIZE/2 limit as well which means that they are part of the code segment and hence executable. Active PaX features: SEGMEXEC and RANDEXEC and MPROTECT [1] 08048000-0804a000 R-Xp 00000000 00:0b 1109 /tmp/cat [2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat 0804b000-0804d000 RW-p 00000000 00:00 0 [3] 20000000-20002000 ++-p 00000000 00:00 0 [4] 20002000-20003000 RW-p 00002000 00:0b 1109 /tmp/cat 20003000-20018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so 20018000-20019000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so 20021000-20146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so 20146000-2014c000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so 2014c000-20150000 RW-p 00000000 00:00 0 [5] 5fffe000-60000000 RW-p 00000000 00:00 0 [6] 80000000-80002000 R-Xp 00000000 00:0b 1109 /tmp/cat 80003000-80018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so 80021000-80146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so Enabling RANDEXEC changes the layout slightly. In particular, beyond the previously shown mappings we can see [3] which represents the dummy anonymous mapping corresponding to the first (executable) PT_LOAD segment of cat and [4] which is the mirror of the second PT_LOAD segment of cat (notice that [2] and [4] have the same page offset values). Also observe that [1] is mirrored above the TASK_SIZE/2 limit by [6] at a different distance than TASK_SIZE/2, hence despite its having the R-X rights, it is not actually executable: logical addresses of [1] are invalid in the code segment, instead it is region [3] whose logical addresses are valid there (in region [6]). The careful reader has probably noticed a small difference between this and the previous situation: the stack area in the first case is 3 pages long whereas [5] here has only 2 pages. The reason for this discrepancy has to do with RANDUSTACK: the first part of the stack randomization cannot be disabled and in our case it happened to cause a big enough shift that made the kernel allocate an extra page for the initial stack. Active PaX features: PAGEEXEC and RANDEXEC and MPROTECT [1] 08048000-0804a000 R--p 00000000 00:0b 1109 /tmp/cat [2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat 0804b000-0804d000 RW-p 00000000 00:00 0 [3] 40000000-40002000 R-Xp 00000000 00:0b 1109 /tmp/cat [4] 40002000-40003000 RW-p 00002000 00:0b 1109 /tmp/cat 40003000-40018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so 40018000-40019000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so 40021000-40146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so 40146000-4014c000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so 4014c000-40150000 RW-p 00000000 00:00 0 bfffe000-c0000000 RW-p fffff000 00:00 0 The last case where vma mirroring takes place has the simplest layout of all as only the main executable is mirrored: [3] mirrors [1] and [4] mirrors [2]. Notice that [1] no longer has R-X rights but R-- as under PAGEEXEC it is the mapping rights that decide what is executable, not the mapping's position.