0. abstract this document discusses the possibility of implementing non-executable pages for IA-32 processors (i.e. pages which user mode code can read or write, but cannot execute code in). since the processor's native page table/directory entry format has no provision for such a feature, it is a non-trivial task. credit for the underlying idea goes to Kevin Lawton, Ramon van Handler and/or whoever else contributed it to the freemware/plex86 project (see footnote 1). 1. intro security of networked computers is an everyday topic of discussions these days. many of the security problems can be attributed to a special attack technique based on exploiting buffer overflows. this kind of attack relies on improperly processed data inside an application (see footnote 2). most of the time improper processing of data results in the execution of the attacker's code inside the application's context (i.e. as if the application itself had executed it). this in effect allows the attacker to perform any operation that the application itself would be able/allowed to. this is a very undesirable behaviour and should be remedied. the natural way of solving the problem is of course fixing the existing errors and making sure that new code is carefully designed/implemented (the perfect programmer case). since humans will always make mistakes, there is a desire to provide some sort of programmatic way for protecting against buffer overflow based attacks. one such idea is the implementation of non-executable pages which eliminates the possibility of executing code in pages which are supposed to hold data only (in a typical scenario a buffer overflow occurs in an array stored on the stack). on some microprocessors the paging logic allows for marking pages for execution only (vs. data read and/or write), however this feature is unfortunately missing from IA-32 and microprocessors based on it (Intel, Amd, etc). 2. the idea in 1999.07 the plex86 (then freemware) project conducted a test with the help of internet users. the purpose of this test was to determine support for a specific feature (some might consider it a bug, although it turns out to be a life-saver) in IA-32 compatible processors. this feature relies on the fact that pentium and newer processors have separate data and instruction translation lookaside buffers (DTLBs and ITLBs). TLBs act as a cache for page table entries (PTEs), hence store user/supervisor access rights among others. under normal circumstances the ITLB and the DTLB entries (for a given linear/physical address pair) get loaded from the same PTE hence they contain a consistent state. the idea that the plex86 project wanted to test is that in theory it is possible to write code which will cause an inconsistent state in the DTLB and ITLB entries. they were interested in having a page where only code execution is allowed, data read/write access would cause a page fault (PF) (and therefore allow detection of self-modifying code and the like, which is an important feature for their virtualization strategy). however, this very same mechanism would allow for creating another kind of inconsistent state where only data read/write accesses would be allowed and code execution prohibited. and this is what is needed for protecting against (many) buffer overflow based attacks. their tests proved successful (see footnote 3). 3. the theory PTE and TLB management is the task of the paging subsystem of the operating system. therefore implementing non-executable pages must be an integral part of the OS. in this section we will describe what such an implementation must take care of, and then attempt to implement it for Windows NT/2000 and Linux. since source code and/or deep system level documentation for the former is not publicly available, we will delve into some reverse engineering as well (hopefully Microsoft will incorporate this feature into an upcoming release of NT themselves). a non-executable page requires creating and maintaining a specific set of states in the TLBs. the state information represents the TLBs' knowledge of the given page and is summarized in the following table: ITLB - | S | U ----------- D - | 0 | 1 | 2 | T ----------- L S | 3 | 4 | 5 | B ----------- U | 6 | 7 | 8 | ----------- that is, either TLB has either no entry for a given linear/physical address pair (-), or has an entry which specifies either User (U) or Supervisor (S) access rights. note that under normal circumstances state 5 and 7 never occur (they are the inconsistent states). the initial state for any page is state 0 (i.e. neither TLB has an entry, since there was no access to the given page). as the processor accesses the page (instruction fetch or data read/write access), the state begins to change in a well defined manner. one half of the possible state transitions is summarized below: 0 -> 1 (ITLB fill, PTE specifies Supervisor rights) 0 -> 2 (ITLB fill, PTE specifies User rights) 3 -> 4 (ITLB fill, PTE specifies Supervisor rights) 3 -> 5 (ITLB fill, PTE specifies User rights) 6 -> 7 (ITLB fill, PTE specifies Supervisor rights) 6 -> 8 (ITLB fill, PTE specifies User rights) 0 -> 3 (DTLB fill, PTE specifies Supervisor rights) 0 -> 6 (DTLB fill, PTE specifies User rights) 1 -> 4 (DTLB fill, PTE specifies Supervisor rights) 1 -> 7 (DTLB fill, PTE specifies User rights) 2 -> 5 (DTLB fill, PTE specifies Supervisor rights) 2 -> 8 (DTLB fill, PTE specifies User rights) the other half of the possible state transitions are like the ones above but going into the opposite direction (they represent the flushing of the TLB entry due to expiry) and the following two: 4 -> 0 (explicit TLB flush caused by an instruction) 8 -> 0 (explicit TLB flush caused by an instruction) the careful reader has probably noticed that there are no direct state transitions between S and U states. the reason for this is that for this discussion we ignore the fact that TLB entries can be directly manipulated via machine specific registers (MSRs). if an OS does indeed manipulate TLB entries directly, then the extra state transitions must be taken into account as well. armed with this information, we can now define 'good' and 'bad' states and state transitions, then determine what it takes to enter and maintain 'good' states only. a state is called 'good' when it does not violate the 'non executable' feature of a page (and it is 'bad' when it does). the 'good' states are as follows: 0,3,6: no code can execute in such a page since it has not even been accessed yet via instruction fetches (there is no ITLB entry) 1,4,7: user mode code (a thread executing at CPL=3) would cause a page fault in such a page, hence the OS could act at such an attempt note that the initial state (0) is a 'good' one, i.e. we will 'only' have to make sure that we never make a transition into a 'bad' state. a state transition is called 'good' when it goes into a 'good' state (and it is a 'bad' one when it goes into a 'bad' state). our goal is to maintain a 'good' state, therefore we will want to get notified when a 'bad' state transition is attempted (and then prevent it of course). since there is no way to get notified on a TLB entry flush (due to expiry or explicit flush), we shall not be concerned with them from now on (except for the 5 -> 2 and 8 -> 2 transitions there are no 'bad' ones of this kind anyway, and these two originate in a 'bad' state which by design should not occur in the first place. see also the note about paranoid security later). the 'good' state transitions are as follows: 0 -> 1 3 -> 4 6 -> 7 0 -> 3 0 -> 6 1 -> 4 1 -> 7 and the 'bad' ones are: 0 -> 2 3 -> 5 6 -> 8 2 -> 5 2 -> 8 note that the last two 'bad' transitions are special in that they originate in a 'bad' state (which by our design should not be possible to happen). depending on how paranoid we are (how much we trust our implementation, general OS integrity, etc), we may or may not take them into account. in the following discussion we will *not* consider them as possible transitions (and therefore will *not* protect against them), however we will also point out how they would affect the design and implementation. it turns out that we are essentially making a performance vs. (paranoid) security decision. now we can turn our attention to determine how we can detect 'bad' state transition attempts in user mode code. we can detect such a transition if we can somehow make the processor interrupt normal execution flow when the transition is just about to happen. it should come as no surprise that the only way to achieve this is to have a page fault at such a transition. the possibilities for each 'bad' transition are as follows: 0 -> 2, 3 -> 5, 6 -> 8: - have a PTE with the Present (P) bit not set (PTE.NP). when any code causes an ITLB load from such a PTE, we will get a page fault. - have a PTE with the U/S bit set to S (PTE.S). when user mode code causes an ITLB load from such a PTE, we will get a page fault. 2 -> 5, 2 -> 8: - have a PTE with the Present (P) bit not set (PTE.NP). when any code causes an ITLB load from such a PTE, we will get a page fault. whatever our choice will be, we have to determine its effects on 'good' transitions as well: - PTE.NP will cause extra page faults for transitions that normally would not (necessarily) cause one - PTE.S will eliminate (transform) some transitions and cause extra page faults the following table describes how each of the above choices would affect 'good' transitions when triggered by supervisor and user mode code (where applicable): trans | choice: PTE.NP PTE.S ition | mode: S U S U ------------------------------------------------------------------- 0 -> 1: PF PF none PF (transformed from 0 -> 2) 3 -> 4: PF PF none PF (transformed from 3 -> 5) 6 -> 7: PF PF none PF (transformed from 6 -> 8) 0 -> 3: PF PF none PF 0 -> 6: PF PF n/a n/a (transformed into 0 -> 3) 1 -> 4: PF n/a none n/a 1 -> 7: PF n/a none n/a from this we see that there is indeed a performance vs. security decision to be made. if we stick to paranoid mode, the only choice we have left is the PTE.NP one. unfortunately this one causes more page faults as well, and therefore will have a bigger impact on general system performance, and from a more practical point of view it would require a more complex change in the existing paging subsystem of an OS as the Present bit of the PTEs plays a crucial role in implementing page (swap) file based virtual memory (and is used/checked for throughout the OS) whereas the User/Supervisor bit normally has no overloaded meaning. to summarize this section, let us state what a non-executable page implementation must do (as far as this specific feature is concerned, the rest of the paging subsytem may need to be modified if for one reason or another it interferes with the non-executable page logic, we shall see in the Linux implementation section that we indeed have an issue regarding copy-on-write): 3.1. create the PTE for the page with the U/S bit set to S. 3.2. during the lifetime of the PTE, keep the U/S bit in the PTE set to S. 3.3. extend the page fault handler to - handle faults during 'bad' state transition attempts (remember that they are transformed): 0 -> 1: 3 -> 4: 6 -> 7: terminate user thread since it has apparently attempted to execute code in a page marked as non-executable. exact action can be more fine grained of course and give a human user or another program the choice for action. not that if anyone enjoyed clicking on yes/no buttons on a busy web server... ;-) - handle faults during 'good' state transition attempts: 0 -> 3: this is normal data access to the page transformed from the 0 -> 6 transition (stack operation, heap access, etc), it must be allowed and handled as fast as possible. to do this: - flush the TLBs for this page (invlpg x86 instruction) - change PTE.S to PTE.U - access page, this will load the DTLB with User rights - change PTE.U back to PTE.S (as 3.2 requires it) - resume user thread to differentiate between the various reasons for the page fault the handler will have to examine the error code (to determine if PTE.S or something else caused the fault) then compare the fault address (register CR2) against the address of the instruction which caused it (saved EIP on the stack) and determine if they are the same ('bad' transition attempt) or not. the last issue that needs to be discussed is how a multiprocessor (MP) environment changes all this. luckily, there is not much to be concerned about. the only time MP is an issue is when the page fault handler attempts to load the DTLB with User rights since during this brief period of time the PTE content violates 3.2 and *if* PTEs are shared among processors we have to prevent that PTE from loading into other processors' ITLB. there are several ways with different impact on performance and security: - on the fault handling processor we can switch to a new private PD and modify/use a private copy of the PTE in question. this method will flush the TLBs (bad), although keep entries for global pages, but does not require any synchronization among processors (good). - stall execution on all the other processors, and modify/use the shared PTE. this method does not affect the TLBs except for the intended DTLB entry (good), but requires synchronization among processors (bad). - modify/use the shared PTE without stalling the other processors. this method does not affect the TLBs except for the intended DTLB entry (good), nor does it require synchronization among processors (good), however it leaves a brief window in time open where the User mode PTE might get loaded into an unintended processor's ITLB (bad). 4. Linux implementation the core structure of the virtual memory manager is vm_area_struct declared in include/linux/mm.h. this structure describes a contigous (in linear address space) block of pages of a task that share the same attributes (readable, writable, executable, shared/private, etc). of interest are two fields: vm_flags: these flags determine what can be done to the given range of pages (read/write/execute), whether changes to the pages are kept private to the current task (copy-on-write is active) or shared with other tasks mapping the same range (think of shared memory), and some other flags not important for our discussion (see footnote 4) vm_page_prot: these are the actual flags put into a PTE whenever physical memory is assigned to the given memory range the translation from vm_flags to vm_page_prot is described by a simple array called protection_map[] and defined in mm/mmap.c. as it turns out, this is where the architecture independent part of the paging subsystem ends as the actual symbols used in the mapping are defined in architecture dependent include files. for IA-32 they are in include/asm-i386/pgtable.h. in our implementation we opted for intentionally breaking existing code and renamed existing constants (in addition to defining new ones), so that programmers in the future will have to explicitly state whether they need executable memory or not. the next step is to extend the page fault handler whose first (and for our discussion relevant) part is architecture dependent. for IA-32 it is in arch/i386/mm/fault.c and is called do_page_fault(). page faults can arise for various reasons, and to expedite handling them IA-32 processors put a specially formatted error code on the stack in addition to the usual stack frame of exception handlers. the processor dependent part of the page fault handler makes its decisions based on the error code and vm_flags of the memory range containing the fault address (of course not all page faults occur in mapped memory, but we are not concerned with them). the error code specifies the following (see footnote 5 too): name possibilities ------------------------------------------------------------------------- reason PTE is not present (PTE.NP) privilege violation in user mode code (PTE.S) write attempt to a read-only or copy-on-write page (PTE.R) read or write the faulting instruction attempted a read or write (R/W) attempt (an instruction fetch counts as a read as well) supervisor the faulting instruction executed in user mode (CPL=3) or user mode or not (CPL=0) next we have to determine the actions taken by the fault handler for each reason and then define how our solution (PTE.S) changes them. the following table sums up all these actions taken by the old and the new handlers: legend: vm_flags: flags in vm_flags (4 LSB only) pte: flags in pte for given vm_flags (3 LSB only) err: possible error codes for given pte flags (3 LSB only) action: sigbus (np): signal task (access to a non-present page) sigbus (w): signal task (write attempt to a read-only page) cow: handle copy-on-write emu: emulate PTE.U and allow access emu/kill: check if access is legitimate (emu) or not (kill) vm_flags original PaX pte err action pte err action ------------------------------------------------------ 0000 000 xx0 sigbus (np) 000 xx0 sigbus (np) 0001 101 x11 sigbus (w) 001 011 sigbus (w) 111 sigbus (w) 101 emu/kill 0010 101 x11 cow 001 011 cow 111 cow 101 emu/kill 011 111 emu 0011 101 x11 cow 001 011 cow 111 cow 101 emu/kill 011 111 emu 0100 101 x11 sigbus (w) 101 011 sigbus (w) 0101 101 x11 sigbus (w) 101 011 sigbus (w) 0110 101 x11 cow 101 011 cow 0111 101 x11 cow 101 011 cow 1000 000 xx0 sigbus (np) 000 xx0 sigbus (np) 1001 101 x11 sigbus (w) 001 011 sigbus (w) 111 sigbus (w) 101 emu/kill 1010 111 cannot fault 011 101 emu/kill 111 emu 1011 111 cannot fault 011 101 emu/kill 111 emu 1100 101 x11 sigbus (w) 101 011 sigbus (w) 1101 101 x11 sigbus (w) 101 011 sigbus (w) 1110 111 cannot fault 111 cannot fault 1111 111 cannot fault 111 cannot fault the following table shows the same information (for PaX) but in a format more suitable for programmers (the initial decision making part of the fault handler is similar to a logic circuit that has to realize a truth table in a possibly optimal way): vm_flags| err | 000 001 010 011 100 101 110 111 --------------------------------------------- 0000 sig sig sig sig 0001 sig e/k sig 0010 cow e/k cow/emu 0011 cow e/k cow/emu 0100 sig 0101 sig 0110 cow 0111 cow 1000 sig sig sig sig 1001 sig e/k sig 1010 e/k emu 1011 e/k emu 1100 sig 1101 sig 1110 1111 now that we have established our new page fault handler, let's discuss some more (or less) Linux specific issues, namely the side effects of PaX on existing code. the first one is that current uses (might as well say abuses...) of the executable stack will be rendered dead. this affects signal handling (in cases where the return-from-sig-handling code is put on the user mode stack) and trampolines (gcc extensions and perhaps other languages can end up in code that is executed from the user mode stack). in the current implementation we decided to take care of the former (and probably more serious) problem only, we leave handling trampolines for other people (in a security conscious environment such code should not be allowed to execute in the first place). luckily, the code put on the user mode stack is of a fixed length/content, therefore all we have to do is simply check for the code pattern. one note of caution: since this check attempts to read memory that may lie on a page other than that of the faulting instruction, we used a little trick by changing the alignment used before the code (as part of a structure) is copied on the user mode stack (the reader is invited to do the math him/herself). there would also be another way out of this apparent abuse of executable stacks whereas the kernel could allocate one (executable) page for this purpose and map it into each task's address space (this would 'waste' some 4 kbytes of memory). the second issue (the less Linux specific one) is the performance impact of PaX. we have both bad and good news. as one can guess, the bigger and more efficient the TLB is, the fewer page faults will occur. on standard IA-32 processors the TLBs appear to have anywhere between 64 and 256 entries (see footnote 6), and they are at least 4-way associative, or even fully associative. this means, that after at most 256 accesses to different pages at least one TLB entry will have to expire (opening the door for a potential page fault in the future). a simple test program such as the one below will demonstrate the effect and measure the maximum impact of PaX. #include int main() { char* buf; int i,j; buf = (char*)malloc(4096*257); for (j=0; j<100000; j++) { for (i=0; i<257; i++) { buf[i*4096] = 'a'; } } return (0); } timing results on a non-patched 2.2.14 and a patched 2.2.17 kernel were as follows: 6.20user 0.01system 0:06.21elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (77major+266minor)pagefaults 0swaps 6.15user 29.74system 0:35.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (77major+266minor)pagefaults 0swaps as expected, in the PaX kernel the extra page fault processing will take up a considerable amount of time (so no free lunch). however there is good news as well. in a real world test (kernel compilation) we measured less than 5-8% impact and in general, during everyday use there is no serious, perceivable slow down. 5. Windows NT/2000 implementation unfortunately lack of time/mood/etc prevented us from doing a thorough research and implement it for now, but interested readers are directed to Ice (white_ice@usa.net) or fOSSiL to take part in the reverse engineering, design and implementation process. a few pointers for those who want to go ahead alone: - for reverse engineering the tool to use is IDA (the Interactive DisAssembler, http://www.datarescue.com). the next source of information is the symbol files available at the Customer Support Diagnostics page (http://www.microsoft.com/WINDOWS2000/downloads/ Other Downloads section has the link) - the following symbols are probably worth a look: - _MmUserProtectionToMask1, _MmUserProtectionToMask2 - @KiFlushSingleTb@8, @KeFlushSingleTb@20 - @MiDetermineUserGlobalPteMask@4, @MiMakeProtectionMask@4 - _MmProtectToPteMask, _MmProtectToValue - the implementation should be a kernel mode driver (KMD) loaded at boot time. it will most likely have to do some kernel patching at runtime which is not trivial (think about thread contexts, SMP). - beware of Physical Address Extension and Page Size Extension - do not trust your favourite debugger, it has very little clues about the Windows NT/2000 paging subsystem. 6. final words (to protect the innocent, names have been changed ;-) well, buffer overflows will be gone in 2-3 years hopefully sooner ;-) a month at most what ?? you bullshitting me now or what ? just sit back and watch ;-) sure i do no ;-) ok phew argh you got me there for a second ;> heh, why? coz I was in panic that there was something big coming up it is already getting harder to find overflows now on a more serious note, the PaX team would like to thank acpizer, dezzy and halvar for their support, help and testing the Linux version. 7. contact information The PaX Team PGP key fingerprint: D2E0 B4B6 16A3 B532 20B8 969B 956D 2366 39F0 81BF for participation in the Windows NT/2000 research/implementation contact Ice or fOSSiL 8. history 2000.08.06 initial document 2000.10.01 pax/linux implementation was born 2000.10.22 fixed minor inaccuracies 2000.10.27 added copy-on-write issue 2000.10.28 finished linux implementation description 2000.11.05 fixed VM_IO handling, SysV IPC shared memory became NOEXEC 2000.11.16 fixed race on page directory/table accesses 9. footnotes 1. http://www.plex86.org/ 2. depending on where this data originates from (local or a remote host), we talk about local and remote exploits, although the underlying problem remains the same. 3. http://www.plex86.org/news.phtml?id=3 4. we found it somewhat unfortunate how VM_STACK_FLAGS was defined. first, its definition was not based on the existing symbolic constants but had the raw hexadecimal value (so a simple 'grep' for 'interesting' symbols would miss it whereas it is of crucial importance for determining the protection flags for stack pages). second, its value included execution permission which is of course exactly what we want to avoid. our patch fixes at least the latter issue (holy lazyness ;-). 5. for PPro+ processors when CR4.PSE is enabled the error code may also indicate whether the fault occured due to some reserved bits in the paging structures not being 0. in Linux (2.2.x at least) this can never happen, hence its omission. 6. see links at http://www.sandpile.org/ in the 'impl' section