0. abstract

   this document discusses the possibility of implementing non-executable
   pages for IA-32 processors (i.e. pages which user mode code can read or
   write, but cannot execute code in). since the processor's native page
   table/directory entry format has no provision for such a feature, it is
   a non-trivial task.

   credit for the underlying idea goes to Kevin Lawton, Ramon van Handler
   and/or whoever else contributed it to the freemware/plex86 project (see
   footnote 1).


1. intro

   security of networked computers is an everyday topic of discussions these
   days. many of the security problems can be attributed to a special attack
   technique based on exploiting buffer overflows. this kind of attack relies
   on improperly processed data inside an application (see footnote 2).

   most of the time improper processing of data results in the execution of
   the attacker's code inside the application's context (i.e. as if the
   application itself had executed it). this in effect allows the attacker to
   perform any operation that the application itself would be able/allowed to.

   this is a very undesirable behaviour and should be remedied. the natural
   way of solving the problem is of course fixing the existing errors and
   making sure that new code is carefully designed/implemented (the perfect
   programmer case).

   since humans will always make mistakes, there is a desire to provide some
   sort of programmatic way for protecting against buffer overflow based
   attacks. one such idea is the implementation of non-executable pages which
   eliminates the possibility of executing code in pages which are supposed
   to hold data only (in a typical scenario a buffer overflow occurs in an
   array stored on the stack).

   on some microprocessors the paging logic allows for marking pages for
   execution only (vs. data read and/or write), however this feature is
   unfortunately missing from IA-32 and microprocessors based on it (Intel,
   Amd, etc).


2. the idea

   in 1999.07 the plex86 (then freemware) project conducted a test with the
   help of internet users. the purpose of this test was to determine support
   for a specific feature (some might consider it a bug, although it turns
   out to be a life-saver) in IA-32 compatible processors.

   this feature relies on the fact that pentium and newer processors have
   separate data and instruction translation lookaside buffers (DTLBs and
   ITLBs). TLBs act as a cache for page table entries (PTEs), hence store
   user/supervisor access rights among others. under normal circumstances
   the ITLB and the DTLB entries (for a given linear/physical address pair)
   get loaded from the same PTE hence they contain a consistent state.

   the idea that the plex86 project wanted to test is that in theory it is
   possible to write code which will cause an inconsistent state in the DTLB
   and ITLB entries. they were interested in having a page where only code
   execution is allowed, data read/write access would cause a page fault (PF)
   (and therefore allow detection of self-modifying code and the like, which
   is an important feature for their virtualization strategy).

   however, this very same mechanism would allow for creating another kind of
   inconsistent state where only data read/write accesses would be allowed
   and code execution prohibited. and this is what is needed for protecting
   against (many) buffer overflow based attacks.

   their tests proved successful (see footnote 3).


3. the theory

   PTE and TLB management is the task of the paging subsystem of the operating
   system. therefore implementing non-executable pages must be an integral part
   of the OS. in this section we will describe what such an implementation must
   take care of, and then attempt to implement it for Windows NT/2000 and Linux.
   since source code and/or deep system level documentation for the former is
   not publicly available, we will delve into some reverse engineering as well
   (hopefully Microsoft will incorporate this feature into an upcoming release
   of NT themselves).

   a non-executable page requires creating and maintaining a specific set of
   states in the TLBs. the state information represents the TLBs' knowledge
   of the given page and is summarized in the following table:

                              ITLB

                               - | S | U
                              -----------
                     D     - | 0 | 1 | 2 |
                     T        -----------
                     L     S | 3 | 4 | 5 |
                     B        -----------
                           U | 6 | 7 | 8 |
                              -----------

   that is, either TLB has either no entry for a given linear/physical address
   pair (-), or has an entry which specifies either User (U) or Supervisor (S)
   access rights.

   note that under normal circumstances state 5 and 7 never occur (they are
   the inconsistent states).

   the initial state for any page is state 0 (i.e. neither TLB has an entry,
   since there was no access to the given page). as the processor accesses
   the page (instruction fetch or data read/write access), the state begins
   to change in a well defined manner. one half of the possible state
   transitions is summarized below:

        0 -> 1 (ITLB fill, PTE specifies Supervisor rights)
        0 -> 2 (ITLB fill, PTE specifies User rights)
        3 -> 4 (ITLB fill, PTE specifies Supervisor rights)
        3 -> 5 (ITLB fill, PTE specifies User rights)      
        6 -> 7 (ITLB fill, PTE specifies Supervisor rights)
        6 -> 8 (ITLB fill, PTE specifies User rights)      

        0 -> 3 (DTLB fill, PTE specifies Supervisor rights)
        0 -> 6 (DTLB fill, PTE specifies User rights)      
        1 -> 4 (DTLB fill, PTE specifies Supervisor rights)
        1 -> 7 (DTLB fill, PTE specifies User rights)      
        2 -> 5 (DTLB fill, PTE specifies Supervisor rights)
        2 -> 8 (DTLB fill, PTE specifies User rights)      

   the other half of the possible state transitions are like the ones above
   but going into the opposite direction (they represent the flushing of the
   TLB entry due to expiry) and the following two:

        4 -> 0 (explicit TLB flush caused by an instruction)
        8 -> 0 (explicit TLB flush caused by an instruction)

   the careful reader has probably noticed that there are no direct state
   transitions between S and U states. the reason for this is that for this
   discussion we ignore the fact that TLB entries can be directly manipulated
   via machine specific registers (MSRs). if an OS does indeed manipulate TLB
   entries directly, then the extra state transitions must be taken into
   account as well.

   armed with this information, we can now define 'good' and 'bad' states and
   state transitions, then determine what it takes to enter and maintain
   'good' states only.

   a state is called 'good' when it does not violate the 'non executable'
   feature of a page (and it is 'bad' when it does). the 'good' states are
   as follows:

        0,3,6: no code can execute in such a page since it has not even been
               accessed yet via instruction fetches (there is no ITLB entry)

        1,4,7: user mode code (a thread executing at CPL=3) would cause a
               page fault in such a page, hence the OS could act at such an
               attempt

   note that the initial state (0) is a 'good' one, i.e. we will 'only' have
   to make sure that we never make a transition into a 'bad' state.

   a state transition is called 'good' when it goes into a 'good' state (and
   it is a 'bad' one when it goes into a 'bad' state). our goal is to maintain
   a 'good' state, therefore we will want to get notified when a 'bad' state
   transition is attempted (and then prevent it of course). since there is
   no way to get notified on a TLB entry flush (due to expiry or explicit
   flush), we shall not be concerned with them from now on (except for the
   5 -> 2 and 8 -> 2 transitions there are no 'bad' ones of this kind anyway,
   and these two originate in a 'bad' state which by design should not occur
   in the first place. see also the note about paranoid security later).

   the 'good' state transitions are as follows:

        0 -> 1
        3 -> 4
        6 -> 7
        0 -> 3
        0 -> 6
        1 -> 4
        1 -> 7

   and the 'bad' ones are:

        0 -> 2
        3 -> 5
        6 -> 8
        2 -> 5
        2 -> 8

   note that the last two 'bad' transitions are special in that they originate
   in a 'bad' state (which by our design should not be possible to happen).
   depending on how paranoid we are (how much we trust our implementation,
   general OS integrity, etc), we may or may not take them into account. in
   the following discussion we will *not* consider them as possible transitions
   (and therefore will *not* protect against them), however we will also point
   out how they would affect the design and implementation. it turns out that
   we are essentially making a performance vs. (paranoid) security decision.

   now we can turn our attention to determine how we can detect 'bad' state
   transition attempts in user mode code. we can detect such a transition if
   we can somehow make the processor interrupt normal execution flow when the
   transition is just about to happen. it should come as no surprise that the
   only way to achieve this is to have a page fault at such a transition.
   the possibilities for each 'bad' transition are as follows:

     0 -> 2, 3 -> 5, 6 -> 8:

        - have a PTE with the Present (P) bit not set (PTE.NP). when any code
          causes an ITLB load from such a PTE, we will get a page fault.

        - have a PTE with the U/S bit set to S (PTE.S). when user mode code 
          causes an ITLB load from such a PTE, we will get a page fault.

     2 -> 5, 2 -> 8:

        - have a PTE with the Present (P) bit not set (PTE.NP). when any code
          causes an ITLB load from such a PTE, we will get a page fault.

   whatever our choice will be, we have to determine its effects on 'good'
   transitions as well:

     - PTE.NP will cause extra page faults for transitions that normally would
       not (necessarily) cause one

     - PTE.S will eliminate (transform) some transitions and cause extra page
       faults

   the following table describes how each of the above choices would affect
   'good' transitions when triggered by supervisor and user mode code (where
   applicable):

     trans | choice: PTE.NP          PTE.S
     ition | mode:   S     U         S     U
     -------------------------------------------------------------------
     0 -> 1:         PF    PF        none  PF (transformed from 0 -> 2)
     3 -> 4:         PF    PF        none  PF (transformed from 3 -> 5)
     6 -> 7:         PF    PF        none  PF (transformed from 6 -> 8)
     0 -> 3:         PF    PF        none  PF
     0 -> 6:         PF    PF        n/a   n/a (transformed into 0 -> 3)
     1 -> 4:         PF    n/a       none  n/a
     1 -> 7:         PF    n/a       none  n/a

   from this we see that there is indeed a performance vs. security decision
   to be made. if we stick to paranoid mode, the only choice we have left is
   the PTE.NP one. unfortunately this one causes more page faults as well,
   and therefore will have a bigger impact on general system performance, and
   from a more practical point of view it would require a more complex change
   in the existing paging subsystem of an OS as the Present bit of the PTEs
   plays a crucial role in implementing page (swap) file based virtual memory
   (and is used/checked for throughout the OS) whereas the User/Supervisor bit
   normally has no overloaded meaning.

   to summarize this section, let us state what a non-executable page
   implementation must do (as far as this specific feature is concerned, the
   rest of the paging subsytem may need to be modified if for one reason or
   another it interferes with the non-executable page logic, we shall see in
   the Linux implementation section that we indeed have an issue regarding
   copy-on-write):

   3.1. create the PTE for the page with the U/S bit set to S.

   3.2. during the lifetime of the PTE, keep the U/S bit in the PTE set to S.

   3.3. extend the page fault handler to

      - handle faults during 'bad' state transition attempts (remember that
        they are transformed):

        0 -> 1:
        3 -> 4:
        6 -> 7:
               terminate user thread since it has apparently attempted to
               execute code in a page marked as non-executable. exact action
               can be more fine grained of course and give a human user or
               another program the choice for action. not that if anyone
               enjoyed clicking on yes/no buttons on a busy web server... ;-)

      - handle faults during 'good' state transition attempts:

        0 -> 3:
                this is normal data access to the page transformed from the
                0 -> 6 transition (stack operation, heap access, etc), it must
                be allowed and handled as fast as possible. to do this:

                - flush the TLBs for this page (invlpg x86 instruction)
                - change PTE.S to PTE.U
                - access page, this will load the DTLB with User rights
                - change PTE.U back to PTE.S (as 3.2 requires it)
                - resume user thread

      to differentiate between the various reasons for the page fault the
      handler will have to examine the error code (to determine if PTE.S
      or something else caused the fault) then compare the fault address
      (register CR2) against the address of the instruction which caused it
      (saved EIP on the stack) and determine if they are the same ('bad'
      transition attempt) or not.

      the last issue that needs to be discussed is how a multiprocessor (MP)
      environment changes all this. luckily, there is not much to be concerned
      about. the only time MP is an issue is when the page fault handler
      attempts to load the DTLB with User rights since during this brief
      period of time the PTE content violates 3.2 and *if* PTEs are shared
      among processors we have to prevent that PTE from loading into other
      processors' ITLB. there are several ways with different impact on
      performance and security:

        - on the fault handling processor we can switch to a new private PD
          and modify/use a private copy of the PTE in question. this method
          will flush the TLBs (bad), although keep entries for global pages,
          but does not require any synchronization among processors (good).

        - stall execution on all the other processors, and modify/use the
          shared PTE. this method does not affect the TLBs except for the
          intended DTLB entry (good), but requires synchronization among
          processors (bad).

        - modify/use the shared PTE without stalling the other processors.
          this method does not affect the TLBs except for the intended DTLB
          entry (good), nor does it require synchronization among processors
          (good), however it leaves a brief window in time open where the User
          mode PTE might get loaded into an unintended processor's ITLB (bad).


4. Linux implementation

   the core structure of the virtual memory manager is vm_area_struct declared
   in include/linux/mm.h. this structure describes a contigous (in linear
   address space) block of pages of a task that share the same attributes
   (readable, writable, executable, shared/private, etc). of interest are two
   fields:

       vm_flags:
           these flags determine what can be done to the given range of pages
           (read/write/execute), whether changes to the pages are kept private
           to the current task (copy-on-write is active) or shared with other
           tasks mapping the same range (think of shared memory), and some
           other flags not important for our discussion (see footnote 4)

       vm_page_prot:
           these are the actual flags put into a PTE whenever physical memory
           is assigned to the given memory range

   the translation from vm_flags to vm_page_prot is described by a simple
   array called protection_map[] and defined in mm/mmap.c. as it turns out,
   this is where the architecture independent part of the paging subsystem
   ends as the actual symbols used in the mapping are defined in architecture
   dependent include files. for IA-32 they are in include/asm-i386/pgtable.h.

   in our implementation we opted for intentionally breaking existing code
   and renamed existing constants (in addition to defining new ones), so that
   programmers in the future will have to explicitly state whether they need
   executable memory or not.

   the next step is to extend the page fault handler whose first (and for our
   discussion relevant) part is architecture dependent. for IA-32 it is in
   arch/i386/mm/fault.c and is called do_page_fault().

   page faults can arise for various reasons, and to expedite handling them
   IA-32 processors put a specially formatted error code on the stack in
   addition to the usual stack frame of exception handlers. the processor
   dependent part of the page fault handler makes its decisions based on the
   error code and vm_flags of the memory range containing the fault address
   (of course not all page faults occur in mapped memory, but we are not
   concerned with them).

   the error code specifies the following (see footnote 5 too):

     name           possibilities
     -------------------------------------------------------------------------
     reason         PTE is not present (PTE.NP)

                    privilege violation in user mode code (PTE.S)

                    write attempt to a read-only or copy-on-write page (PTE.R)


     read or write  the faulting instruction attempted a read or write (R/W)
     attempt        (an instruction fetch counts as a read as well)


     supervisor     the faulting instruction executed in user mode (CPL=3)
     or user mode   or not (CPL=0)
                      
   next we have to determine the actions taken by the fault handler for each
   reason and then define how our solution (PTE.S) changes them. the following
   table sums up all these actions taken by the old and the new handlers:

     legend:

       vm_flags: flags in vm_flags (4 LSB only)

       pte:      flags in pte for given vm_flags (3 LSB only)

       err:      possible error codes for given pte flags (3 LSB only)

       action:   sigbus (np): signal task (access to a non-present page)
                 sigbus (w):  signal task (write attempt to a read-only page)
                 cow:         handle copy-on-write
                 emu:         emulate PTE.U and allow access
                 emu/kill:    check if access is legitimate (emu) or not (kill)


     vm_flags   original                PaX
                pte err action          pte err action
     ------------------------------------------------------
     0000       000 xx0 sigbus (np)     000 xx0 sigbus (np)
     0001       101 x11 sigbus (w)      001 011 sigbus (w)
                                            111 sigbus (w)
                                            101 emu/kill
     0010       101 x11 cow             001 011 cow
                                            111 cow
                                            101 emu/kill
                                        011 111 emu
     0011       101 x11 cow             001 011 cow
                                            111 cow
                                            101 emu/kill
                                        011 111 emu

     0100       101 x11 sigbus (w)      101 011 sigbus (w)
     0101       101 x11 sigbus (w)      101 011 sigbus (w)
     0110       101 x11 cow             101 011 cow
     0111       101 x11 cow             101 011 cow

     1000       000 xx0 sigbus (np)     000 xx0 sigbus (np)
     1001       101 x11 sigbus (w)      001 011 sigbus (w)
                                            111 sigbus (w)
                                            101 emu/kill
     1010       111 cannot fault        011 101 emu/kill
                                            111 emu
     1011       111 cannot fault        011 101 emu/kill
                                            111 emu

     1100       101 x11 sigbus (w)      101 011 sigbus (w)
     1101       101 x11 sigbus (w)      101 011 sigbus (w)
     1110       111 cannot fault        111 cannot fault
     1111       111 cannot fault        111 cannot fault


   the following table shows the same information (for PaX) but in a format
   more suitable for programmers (the initial decision making part of the
   fault handler is similar to a logic circuit that has to realize a truth
   table in a possibly optimal way):

     vm_flags| err
             | 000 001 010 011 100 101 110 111
     ---------------------------------------------
     0000      sig     sig     sig     sig
     0001                  sig     e/k     sig
     0010                  cow     e/k     cow/emu
     0011                  cow     e/k     cow/emu

     0100                  sig
     0101                  sig
     0110                  cow
     0111                  cow

     1000      sig     sig     sig     sig
     1001                  sig     e/k     sig
     1010                          e/k     emu
     1011                          e/k     emu

     1100                  sig
     1101                  sig
     1110
     1111

   now that we have established our new page fault handler, let's discuss
   some more (or less) Linux specific issues, namely the side effects of PaX
   on existing code. the first one is that current uses (might as well say
   abuses...) of the executable stack will be rendered dead. this affects
   signal handling (in cases where the return-from-sig-handling code is put
   on the user mode stack) and trampolines (gcc extensions and perhaps other
   languages can end up in code that is executed from the user mode stack).

   in the current implementation we decided to take care of the former (and
   probably more serious) problem only, we leave handling trampolines for
   other people (in a security conscious environment such code should not be
   allowed to execute in the first place).

   luckily, the code put on the user mode stack is of a fixed length/content,
   therefore all we have to do is simply check for the code pattern. one note
   of caution: since this check attempts to read memory that may lie on a
   page other than that of the faulting instruction, we used a little trick
   by changing the alignment used before the code (as part of a structure) is
   copied on the user mode stack (the reader is invited to do the math 
   him/herself).

   there would also be another way out of this apparent abuse of executable
   stacks whereas the kernel could allocate one (executable) page for this
   purpose and map it into each task's address space (this would 'waste' some
   4 kbytes of memory).

   the second issue (the less Linux specific one) is the performance impact
   of PaX. we have both bad and good news. as one can guess, the bigger and
   more efficient the TLB is, the fewer page faults will occur. on standard
   IA-32 processors the TLBs appear to have anywhere between 64 and 256
   entries (see footnote 6), and they are at least 4-way associative, or
   even fully associative. this means, that after at most 256 accesses to
   different pages at least one TLB entry will have to expire (opening the
   door for a potential page fault in the future). a simple test program
   such as the one below will demonstrate the effect and measure the maximum
   impact of PaX.

      #include <stdio.h>

      int main()
      {
        char* buf;
        int i,j;

        buf = (char*)malloc(4096*257);
        for (j=0; j<100000; j++) {
          for (i=0; i<257; i++) {
            buf[i*4096] = 'a';
          }
        }
        return (0);
      }

   timing results on a non-patched 2.2.14 and a patched 2.2.17 kernel were
   as follows:

   6.20user 0.01system 0:06.21elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (77major+266minor)pagefaults 0swaps

   6.15user 29.74system 0:35.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (77major+266minor)pagefaults 0swaps

   as expected, in the PaX kernel the extra page fault processing will take
   up a considerable amount of time (so no free lunch). however there is good
   news as well. in a real world test (kernel compilation) we measured less
   than 5-8% impact and in general, during everyday use there is no serious,
   perceivable slow down.


5. Windows NT/2000 implementation

   unfortunately lack of time/mood/etc prevented us from doing a thorough
   research and implement it for now, but interested readers are directed
   to Ice (white_ice@usa.net) or fOSSiL <fossah@usa.net> to take part in
   the reverse engineering, design and implementation process.

   a few pointers for those who want to go ahead alone:

       - for reverse engineering the tool to use is IDA (the Interactive
         DisAssembler, http://www.datarescue.com). the next source of
         information is the symbol files available at the Customer Support
         Diagnostics page (http://www.microsoft.com/WINDOWS2000/downloads/
         Other Downloads section has the link)

       - the following symbols are probably worth a look:
           - _MmUserProtectionToMask1, _MmUserProtectionToMask2
           - @KiFlushSingleTb@8, @KeFlushSingleTb@20
           - @MiDetermineUserGlobalPteMask@4, @MiMakeProtectionMask@4
           - _MmProtectToPteMask, _MmProtectToValue

       - the implementation should be a kernel mode driver (KMD) loaded at
         boot time. it will most likely have to do some kernel patching at
         runtime which is not trivial (think about thread contexts, SMP).

       - beware of Physical Address Extension and Page Size Extension

       - do not trust your favourite debugger, it has very little clues about
         the Windows NT/2000 paging subsystem.


6. final words

   (to protect the innocent, names have been changed ;-)

       <innocent> well, buffer overflows will be gone in 2-3 years 
       <innocent> hopefully
       <pax> sooner ;-)
       <pax> a month at most
       <innocent> what ??
       <innocent> you bullshitting me now or what ?
       <pax> just sit back and watch ;-)
       <pax> sure i do
       <pax> no ;-)
       <innocent> ok
       <innocent> phew
       <innocent> argh
       <innocent> you got me there for a second ;>
       <pax> heh, why?
       <innocent> coz I was in panic that there was something big coming up
       <innocent> it is already getting harder to find overflows now


   on a more serious note, the PaX team would like to thank acpizer, dezzy
   and halvar for their support, help and testing the Linux version.


7. contact information

   The PaX Team <pageexec@freemail.hu>
   PGP key fingerprint: D2E0 B4B6 16A3 B532 20B8  969B 956D 2366 39F0 81BF

   for participation in the Windows NT/2000 research/implementation contact
   Ice <white_ice@usa.net> or fOSSiL <fossah@usa.net>


8. history

   2000.08.06 initial document
   2000.10.01 pax/linux implementation was born
   2000.10.22 fixed minor inaccuracies
   2000.10.27 added copy-on-write issue
   2000.10.28 finished linux implementation description
   2000.11.05 fixed VM_IO handling, SysV IPC shared memory became NOEXEC
   2000.11.16 fixed race on page directory/table accesses


9. footnotes

   1. http://www.plex86.org/

   2. depending on where this data originates from (local or a remote host),
      we talk about local and remote exploits, although the underlying problem
      remains the same.

   3. http://www.plex86.org/news.phtml?id=3

   4. we found it somewhat unfortunate how VM_STACK_FLAGS was defined. first,
      its definition was not based on the existing symbolic constants but had
      the raw hexadecimal value (so a simple 'grep' for 'interesting' symbols
      would miss it whereas it is of crucial importance for determining the
      protection flags for stack pages). second, its value included execution
      permission which is of course exactly what we want to avoid. our patch
      fixes at least the latter issue (holy lazyness ;-).

   5. for PPro+ processors when CR4.PSE is enabled the error code may also
      indicate whether the fault occured due to some reserved bits in the
      paging structures not being 0. in Linux (2.2.x at least) this can never
      happen, hence its omission.

   6. see links at http://www.sandpile.org/ in the 'impl' section