1. Generic Notes & Some Design

   To understand the future direction of PaX, let's summarize what we achieve
   currently. The goal is to prevent/detect exploiting of software bugs that
   allow arbitrary read/write access to the attacked process. Exploiting such
   bugs gives the attacker three different levels of access into the life of
   the attacked process:

   (1) introduce/execute arbitrary code
   (2) execute existing code out of original program order
   (3) execute existing code in original program order with arbitrary data

   Non-executable pages (NOEXEC) and mmap/mprotect restrictions (MPROTECT)
   prevent (1) with one exception: if the attacker is able to create/write
   to a file on the target system then mmap() it into the attacked process
   then he will have effectively introduced and executed arbitrary code.

   Address space layout randomization (ASLR) prevents all of (1), (2) and (3)
   in a probabilistic sense for bugs where the attacker needs advance knowledge
   of addresses in the attacked process *and* he cannot learn about them (i.e.,
   there is no information leaking bug in the target).

   It is also worth noting that since all of PaX is implemented in the kernel,
   the kernel is considered as the Trusted Computing Base which can be subject
   to the same kind of attacks as userland.

   Based on the above the future efforts will aim at the following:

   (a) try to handle the exception to (1)
   (b) implement all possible features for protecting the kernel itself against
       exploiting kernel bugs, optionally (long term) create a non-kernel mode
       TCB
   (c) implement deterministic protection for (2) and maybe (3)
   (d) implement probabilistic protection for (2) potentially resisting
       information leaking

   We do not plan to deal with (a) in general, it is better left for Access
   Control and/or Trusted Path Execution systems which are beyond our project
   (e.g., grsecurity or some LSM modules).

   Note also that detecting/preventing (3) is probably equivalent to the
   halting problem (since the protection would have to detect data changes and
   their influences on all possible execution flows), so we do not expect
   generic solutions such as for (1). Even detecting/preventing (2) looks
   suspicious in this regard (given our attack model), but we have yet to see
   where and to what extent we can make a compromise to achieve the goal.


2. More Design & Implementation

   In the following we will give some details on (a), (b), (c) and (d) although
   it should be noted that all this is just plans, we do not know yet what will
   prove to be practical in the end. There are three measures that will help
   answer this question though:
   - how many/extensive changes are required (complexity),
   - how much protection is achieved (efficiency),
   - is it fast enough for practical purposes (performance).
   In the following we will try to answer these questions as well.

   (a.1) for a certain subset of the exception there can be a simple solution,
         requiring only some changes in the dynamic linker: if the target
         application is dynamically linked but does not load libraries
         explicitly (via the dlopen() interface) then the dynamic linker could
         be modified to tell the kernel when it is done loading all the shared
         libraries and afterwards the kernel could prevent all executable file
         mappings (or at least remove the executable status on them).

         This solution will require userland changes (vs. kernel only) but
         they will be fairly simple. It will also be efficient (achieves the
         goal for this set of applications) and fast (basically no measurable
         performance impact).

   (a.2) actually, the above kernel notification is not strictly needed, the
         kernel itself could determine if previous file mappings reference
         dlopen() and whether the dynamic linker is done loading libraries.

         This solution will be kernel only however it will be more complex
         as we will have to parse the ELF headers to determine whether the
         dlopen() interface is referenced for dynamic linking (in (a.1) the
         dynamic linker already has the necessary parsing code). It will also
         be as efficient and fast as (a.1) since it will do the same kind of
         processing, just inside the kernel.

   (b.1) non-executable kernel pages: using the segmentation logic of IA-32
         we can reduce the KERNEL_CS segment to cover only the kernel image,
         furthermore we can make this area read-only then using the paging
         logic. This step will most likely require the reorganization of the
         kernel image (by modifying the kernel linker script and maybe even
         source code). How module support fits into this is yet to be determined.

         This solution will be complex and may not be kernel-only (depends on
         how the module problem can be solved, in particular if the current
         modutils support mapping different ET_REL ELF file sections into non-
         contiguous memory areas). We expect no performance impact from this
         (much like SEGMEXEC) and to be very efficient (i.e., this will prevent
         introduction/execution of arbitrary code in kernel land).

   (b.2) read-only kernel function pointers: one way of doing (2) is to modify
         function pointers and this is what this method would prevent.

         This solution will require kernel-only changes, however they will be
         quite extensive (although simple) as normally Linux does not use the
         'const' qualifier for structures that contain such pointers (even if
         they are really not meant to be modified). Maybe this could also be
         brought to the attention of the mainline kernel developers since this
         is more of a program correctness issue than that of security per se.

         We expect no performance impact from this however we note that this
         is not an efficient solution since there are still methods of
         achieving (2), in particular there are function pointer like entities
         on the kernel stack (saved return EIP values from function calls).
         These can however be addressed as well as we will show it later.

   (b.3) non-kernel based TCB: IA-32 has two execution modes that are normally
         beyond the scope of normal programs (be that OS kernels or userland):
         System Management Mode (SMM) and Probe Mode. SMM is normally used to
         implement power management schemes (APM/ACPI), while Probe Mode is
         used for low-level (hardware) debugging. Since using Probe Mode would
         require extra hardware normally not present in consumer systems, we
         will discuss SMM only as most current systems have the necessary
         hardware support in place already (the PCI chipset) and the rest
         requires pure software only.

         SMM has a few features that would allow for implementing a new TCB,
         beyond the reach and control of traditional OS kernels (i.e., it could
         not be compromised even by ring-0 protected mode code).

         The first feature is that the PCI chipset (North Bridge or Memory
         Controller Hub in particular) allows to carve out some amount of RAM
         for SMM use only (i.e., such memory could be accessed only while the
         CPU is in SMM). Normally this memory is set up by the BIOS which will
         store the power management code there (the System Management Interrupt
         (SMI) handler and related data).

         The second feature is that SMI can be generated by both software and
         hardware means, i.e., it is possible to create an API between the SMI
         handler and the rest of the kernel. Also the SMI handler can be
         invoked periodically, beyond normal kernel control.

         These features would allow for storing various cryptographic
         primitives inside the SMI handler and have them validate the state of
         the kernel periodically. They would also provide the kernel with some
         limited amount of secure data storage (albeit volatile).

         This solution is very complex, very hardware specific however very
         efficient as well. Depending on the frequency of the SMI, it may or
         may not have a noticable performance impact.

   (c.0) an attacker can achieve (2) if he is able to change the execution flow
         of the target to his liking. Since such changes normally occur at very
         specific instructions (jmp, call, retn), our goal is to protect these.

         All of these methods will necessarily be userland changes, the kernel
         could not perform them fast enough, if at all. Note also that these
         methods can be implemented at various levels of abstraction, such as
         the source code, the compiler, the static and the dynamic linker.

         Complexity is expected to be high in general regardless of what level
         they are implemented at. We also expect that we will have to make
         tradeoffs between efficiency and performance, and even in the extreme
         we do not expect to be very efficient (that is, there will remain
         attack methods under specific circumstances and we may not even be
         able to precisely determine them for a given target).

         With this said, let's look at a few ideas now.

   (c.1) To protect execution flow changes via the jmp/call instructions, we
         would have to be able to identify all explicit function pointers in
         the target and make them read-only. While we can at least identify
         them if we have the source code or there are sufficient relocations
         present in the given ELF file (this is where using ET_DYN ELF files
         becomes important, even for normal executables), making them read-only
         is a problem.

         First, such function pointers are normally not qualified as 'const'
         in the original source code (in whatever language), so in memory
         they will end up being mixed with other writable data and hence using
         the paging logic to make such pages read-only would introduce a big
         performance impact. Second, there are function pointers which need
         to be writable by the program logic itself. In any case, reducing the
         number of writable function pointers does decrease the success rate
         of (2) since there will be less places where the execution flow can be
         diverted (and the remaining places may not be enough to successfully
         attack the given target).

         The most important function pointers are the following:

         - GOT entries
         - .ctors/.dtors
         - atexit() handlers
         - malloc() callbacks
         - linuxthread callbacks (atfork() handlers)

         Of these at least the first two sets can be easily isolated and made
         read-only, the others are by their nature writable. To protect the
         GOT entries we will have to make changes to both the static and the
         dynamic linker. The static linker will have to place the GOT into
         its own PT_LOAD segment so that later the dynamic linker can use
         mprotect() to make it read-only. Making the GOT read-only is possible
         either at application startup or on every GOT entry update. The former
         requires the enforcement of the LD_BIND_NOW behaviour and it affects
         startup time (for the worse), whereas the latter affects GOT entry
         resolution time (it needs two mprotect() calls to make the GOT
         temporarily writable then read-only again) and is open to races
         (although probably not easy to exploit since by the time an attack
         would take place, most needed GOT entries should have been resolved).

         Note that being able to change the writability status of the GOT is
         not a problem since under PaX abusing it would require function
         pointer hijacking (to call mprotect() with proper arguments) which
         is either not possible or if it is, then the attacker is already able
         to call any function and pass arguments to it, so being able to change
         the GOT does not give him anything extra.

         The actual implementation does not need to be an ld.so modification,
         it can be another shared library loaded through /etc/ld.so.preload.

         Protecting .ctors/.dtors is as simple as changing the linker script
         to put these sections into a read-only segment.

   (c.2) To protect execution flow changes via the retn instruction, we face
         the same problem as above with writable (by their nature) function
         pointers. What makes the situation different is that we can determine
         in advance (at compile/link time) and verify at runtime the valid
         values for a given saved EIP on the stack. An efficient and simple
         example is shown below:

           callee
           epilogue:     mov register,[esp]
                         cmp [register+1],MAGIC
                         jnz .1
                         retn
                     .1: jmp esp

           caller:
                         call callee
                         test eax,MAGIC

         What happens here is that we insert an instruction (that does not
         otherwise affect the program logic) after the call which encodes in
         its body a magic number derived from the callee's prototype and then
         before returning from the callee we verify that we are returning to
         a proper call place or not (register is anything that can be trashed
         inside a function). The jmp esp will simply trigger the non-executable
         page enforcement logic and terminate the task.

         This method is efficient in that it sharply reduces the number of code
         locations where a retn can be forced to return to, and in particular
         it cannot return to the normal entry point of functions which makes
         argument passing and stack balancing non-trivial. Note also that it
         does not matter if the call itself is a direct or indirect one, i.e.,
         calls through traditional function pointers are just as well protected
         as normal function calls (although it may be worth separating the two
         cases because the magic number for directly called functions can be
         based on more information, such as the function's name).

         The drawback of this method is that reaction on an attack is delegated
         to userland. It means that when deploying this method in a system,
         every component (executables + libraries) must be recompiled at once,
         it cannot be gradually introduced into the system nor can binaries be
         moved across systems of different nature nor can it be disabled easily
         (that is, it would require another full system recompilation). As we
         will see later, there is a probabilistic approach that remedies some
         of these problems.

   (c.3) The next method works by assuming a weakened attack model where we
         still assume arbitrary read/write access but we only aim to prevent
         attacks that need to divert execution flow more than once (i.e., the
         multiple return-to-libc style attack).

         In this case we can modify the function prologue/epilogue code so
         that a necessary 'ingredient' for the multiple style of the attack
         would no longer be present at all. Let's observe first that on IA-32
         function parameters are passed on the stack normally (vs. registers)
         and hence the stack needs to be rebalanced after a function call. In
         a multiple return-to-libc style attack an attacker has to explicitly
         rebalance the stack after every diverted function call (under Linux
         the default calling convention is 'caller rebalances' vs. the 'callee
         rebalances' convention used in Windows). Rebalancing requires the
         modification of ESP by a specific amount and then also transferring
         control to execute the next step of the attack. Such an instruction
         sequence is normally found in the epilogue code of functions and
         therefore this is what has to be modified (and for symmetry reasons,
         the prologue as well). Our idea is to modify ESP inside a function so
         that it would be way off track while the function executes and have
         a prologue and epilogue code similar to this:

                     without frame ptr              with frame ptr

           prologue: sub esp,BIG_VALUE+4            push ebp
                     mov [esp+BIG_VALUE],register   mov ebp,esp
                     ...                            push register
                                                    sub esp,BIG_VALUE
                                                    ...

                     mov [esp+BIG_VALUE+TOP],param  mov [esp+BIG_VALUE+TOP],param
                     add esp,BIG_VALUE-TOP-4        add esp,BIG_VALUE-TOP-4
                     call func                      call func         
                     sub esp,BIG_VALUE-TOP          sub esp,BIG_VALUE-TOP
                     ...                            ...

           epilogue: mov register,[esp+BIG_VALUE]   mov register,[esp+BIG_VALUE]
                     add esp,BIG_VALUE+4            mov ebp,[esp+BIG_VALUE+4]
                     retn                           add esp,BIG_VALUE+8
                                                    retn

         It is clear that an appropriately chosen BIG_VALUE will prevent the
         stack rebalancing since during an attack only the epilogue is executed
         without the corresponding prologue. Note also that ESP has to be
         restored across function calls so that the callee can perform the same
         ESP modification. This method is expected to be efficient but also to
         have a noticable impact on performance, although we cannot tell how
         much (and whether it will be acceptable for practical purposes). Also
         it will be complex if for no other reason than signal delivery which
         wants to use the current stack (ESP) and would result in the kernel
         killing the task in our case. There are different ways of addressing
         this problem, such as delaying signal delivery (e.g., during system
         calls we will have a valid userland ESP so at that time signals can
         be safely delivered) or setting up a special signal stack (which is
         an already supported POSIX feature, although doing it in multithreaded
         applications would not be trivial) or teaching the kernel about the
         userland ESP shift value and have it take that into account before
         delivering the signal.

   (c.4) attack method (3) is probably the hardest problem of all. Since here
         the attacker relies on normal program behaviour (as far as execution
         flow is concerned), we cannot detect changes there. What is possible
         is to try to prevent specific ways of modifying arbitrary memory and
         hence reduce the potential arsenal an attacker can use.

         The following is a list of the most well-known and widespread bugs
         that give some kind of write (and sometimes read) ability:

         - stack based array overflow
         - heap based array overflow
         - user controlled format string
         - integer overflow
         - incorrect integer signedness change

         There are already a few known solutions for detecting/preventing
         exploiting such bugs (compiler modifications), where we can extend
         such work is to try to do them at the binary level (i.e., without
         access to the source code). The best candidates for such modifications
         are ET_DYN ELF files (typically libraries), ET_EXEC ELF executables
         have the fundamental problem of not having enough relocations (maybe
         vendors should be educated to produce either ET_DYN executables in
         the future or at least generate sufficient relocations into ET_EXEC
         files, GNU ld makes the latter very easy now).

   (d.1) if we weaken our attack model by assuming arbitrary write ability but
         no reading then we can extend ASLR to introduce randomness into the
         saved values of EIP and hence deter attacks such as the one described
         in Phrack #59-9. An efficient and simple (fast) implementation could
         be the modification of function prologue and epilogue code as shown
         below:

                     without frame ptr  with frame ptr

           prologue: xor [esp],esp      xor ebp,[esp]
                     sub esp,xxxx       xor [esp],esp
                     ...                push ebp
                                        mov ebp,esp
                                        sub esp,xxxx
                                        ...

           epilogue: add esp,xxxx       add esp,xxxx/mov esp,ebp
                     xor [esp],esp      pop ebp
                     retn/retn xx       xor [esp],esp
                                        xor ebp,[esp]
                                        retn/retn xx

         This trick uses the fact that under ASLR ESP has randomness in its
         lower 8 bits as well (bits 4-7).

   (d.2) The next method is the probabilistic version of saved EIP checking.
         Here the kernel will create a random 32 bit value (cookie) on every
         system call and place it into a variable provided by userland (it can
         be described in the ELF header for example). Then we modify userland
         as follows:

           callee
           epilogue:     mov register,[esp]
                         mov register,[register+1]
                         sub register,MAGIC
                         neg register
                         sbb register,register
                         lock or [cookie],register
                         retn

           caller:
                         call callee
                         test eax,MAGIC

         It is clear that during a normal execution flow the value of cookie
         does not change after a function returns hence it can be checked for
         whenever the task enters the kernel through a system call. Should the
         kernel find a discrepancy between the cookie as stored in userland and
         its own version, it can react like it does for the non-executable page
         violations. To make information leaking useless the kernel can simply
         generate a new cookie on each system call (after having verified the
         previous one of course), this way an attacker cannot learn the current
         value of the cookie since that itself requires communication and hence
         a system call which would obsolete the leaked information.

         This method is expected to be as efficient as its deterministic
         version and it also suffers from one less problem: it can be applied
         gradually to a system, no need to recompile everything at once, also
         protected executables can be freely moved between different systems
         without adverse effects.

         Note that this method is free from races (coming from signal delivery
         or multiple threads) because cookie is updated in one irreversible
         atomic or instruction (the kernel should of course prevent generating
         the 0xffffffff value for the cookie, but since its probability is very
         low, this is not critical).


   Finally, we would like to improve on the existing methods PaX provides.
   In particular, we should change the way we get randomness for ASLR from
   the kernel since the current method is too exhaustive and is an overkill
   since we do not really need such a good quality randomness at all. The
   other possible improvment is in the fast path of the PAGEEXEC page fault
   handler which could be sped up a bit by using assembly.