Difficulty: Intermediate

Module 5: Prologue & Epilogue Stubs

The injected code at function boundaries that enables transparent decrypt-on-call and re-encrypt-on-return.

Module Objective

Understand the exact structure of the prologue and epilogue stubs injected by X86RetModPass: the 0x46-byte prologue with its CALL/POP PIC trick, register preservation, handler invocation, and the epilogue stub that replaces each RET instruction. Learn how these stubs achieve position independence and transparent operation.

1. Prologue Stub Overview

The prologue stub is a 0x46-byte (70-byte) sequence prepended to every registered function. It runs before the function’s original code and performs one critical task: call the handler to decrypt the function body.

Prologue Stub Layout

TextFunction Start (as seen by callers)
+-----------------------------------------------+
| Bytes 0x00-0x05: CALL/POP PIC trick            |  ← Get current RIP
| Bytes 0x06-0x20: Save registers (push)         |  ← Preserve all volatile regs
| Bytes 0x21-0x30: Save flags (pushfq)           |
| Bytes 0x31-0x3A: Set up handler args           |  ← Pass "decrypt" flag + function ptr
| Bytes 0x3B-0x3F: CALL handler                  |  ← Invoke the shared handler
| Bytes 0x40-0x44: Restore flags + registers     |  ← Restore original state
| Byte  0x45:      Fall through                  |  ← Continue to decrypted function body
+-----------------------------------------------+
| Original Function Body (now decrypted)         |
+-----------------------------------------------+

The key challenge is that the stub must work regardless of where the function is loaded in memory (ASLR). This is solved with the CALL/POP trick for position-independent code (PIC).

2. The CALL/POP PIC Trick

Position-independent code needs to know its own address at runtime. In x86-64, the standard trick is:

x86-64 Assembly; CALL/POP trick for position-independent addressing
prologue_start:
    call    next_instruction    ; CALL pushes return address (RIP+5) onto stack
next_instruction:
    pop     rbx                 ; POP retrieves the address of 'next_instruction'
    ; Now RBX = address of 'next_instruction' at runtime
    ; We can calculate any relative offset from here

The CALL instruction pushes the address of the next instruction onto the stack. The immediately following POP retrieves it into a register. Now the stub knows its own address and can calculate relative offsets to:

The function body start (prologue start + 0x46)
The handler function (known offset from the stub)
The .funcmeta section (via PE header walking)

Why Not LEA with RIP-Relative?

x86-64 supports LEA RAX, [RIP + offset] which is also position-independent. However, at the time the stub is emitted (PreEmit phase), the exact distance to the handler may not be finalized (the linker hasn’t placed sections yet). The CALL/POP trick works without knowing the absolute distance at compile time — the handler address can be resolved through PE header traversal at runtime.

3. Register Preservation in the Prologue

The prologue must save and restore ALL registers it touches. Under the Microsoft x64 ABI, the function’s callers may have placed arguments in RCX, RDX, R8, R9, and on the stack. The prologue must not disturb any of these:

x86-64 Assembly; Complete prologue stub (simplified)
prologue:
    ; === PIC: Get our own address ===
    call    .Lnext
.Lnext:
    pop     rbx                 ; RBX = runtime address of .Lnext

    ; === Save ALL volatile registers ===
    push    rax
    push    rcx                 ; Arg 1 (MS x64 ABI)
    push    rdx                 ; Arg 2
    push    r8                  ; Arg 3
    push    r9                  ; Arg 4
    push    r10
    push    r11
    pushfq                      ; Save CPU flags (CF, ZF, SF, OF, etc.)

    ; === Set up handler arguments ===
    ; RCX = pointer to this function's body (RBX + offset_to_body)
    lea     rcx, [rbx + BODY_OFFSET]
    ; RDX = operation flag: 0 = decrypt, 1 = encrypt
    xor     rdx, rdx            ; 0 = decrypt

    ; === Call the handler ===
    ; Handler address is resolved via TEB or embedded offset
    call    handler_address

    ; === Restore ALL registers ===
    popfq
    pop     r11
    pop     r10
    pop     r9
    pop     r8
    pop     rdx
    pop     rcx
    pop     rax
    pop     rbx                 ; Restore RBX used by CALL/POP

    ; === Fall through to function body (now decrypted) ===
function_body_start:
    ; Original function instructions begin here

RBX Is Non-Volatile

Under the Microsoft x64 calling convention, RBX is a non-volatile (callee-saved) register. The prologue saves and restores it explicitly because the CALL/POP trick uses it. The function body (generated by the compiler) also saves/restores RBX if it uses it, so there’s no conflict — the prologue’s push/pop of RBX happens before the compiler’s own frame setup.

4. Epilogue Stub Overview

Every RET instruction in the function is replaced with an epilogue stub. The epilogue performs the inverse of the prologue: it calls the handler to re-encrypt the function body, then executes the original return.

x86-64 Assembly; Epilogue stub (replaces each RET instruction)
epilogue:
    ; === Save registers (same set as prologue) ===
    push    rbx
    push    rax
    push    rcx
    push    rdx
    push    r8
    push    r9
    push    r10
    push    r11
    pushfq

    ; === PIC: Get our address ===
    call    .Lepinext
.Lepinext:
    pop     rbx

    ; === Set up handler arguments ===
    lea     rcx, [rbx - EPILOGUE_TO_BODY_OFFSET]  ; Pointer to function body
    mov     rdx, 1              ; 1 = encrypt (re-mask)

    ; === Call handler ===
    call    handler_address

    ; === Restore registers ===
    popfq
    pop     r11
    pop     r10
    pop     r9
    pop     r8
    pop     rdx
    pop     rcx
    pop     rax
    pop     rbx

    ; === Execute the original return ===
    ret

The critical difference from the prologue is the operation flag: RDX = 1 tells the handler to encrypt rather than decrypt.

5. Return Value Preservation

When a function returns a value, it is placed in RAX (integers/pointers) or XMM0 (floating point) per the Microsoft x64 ABI. The epilogue stub must preserve this return value:

RAX Preservation

Notice in the epilogue that RAX is pushed before the handler call and popped after. This ensures the return value survives the re-encryption process. The handler itself does not return a meaningful value in RAX (or if it does, it’s overwritten by the pop), so the function’s return value is correctly passed to the caller.

C++// Example: function returns an int
__attribute__((annotate("peekaboo")))
int compute_hash(const char* data) {
    int hash = 0;
    // ... compute hash ...
    return hash;  // hash is in RAX when RET executes
}

// After instrumentation:
// 1. Prologue decrypts function body
// 2. compute_hash runs, puts result in RAX
// 3. Before RET: epilogue saves RAX, re-encrypts body, restores RAX
// 4. RET: caller receives correct hash value in RAX

6. Stack Frame Interaction

The prologue stub runs before the compiler-generated stack frame setup (the standard push rbp; mov rbp, rsp; sub rsp, N sequence). The epilogue stub runs after the compiler-generated frame teardown but before the actual RET:

TextExecution order:

CALL from caller
  |
  v
[Prologue Stub]          ← Decrypt function
  |
  v
[Compiler Frame Setup]   ← push rbp; mov rbp, rsp; sub rsp, ...
  |
  v
[Function Body]          ← Original code
  |
  v
[Compiler Frame Teardown] ← add rsp; pop rbp
  |
  v
[Epilogue Stub]          ← Re-encrypt function
  |
  v
[RET]                    ← Return to caller

This ordering is important because the prologue stub’s register pushes are balanced by its own pops before the compiler’s frame setup. The stack is in the exact state the compiler expects when the frame setup code runs.

7. Recursive and Reentrant Calls

What happens if a peekaboo function calls another peekaboo function? Or calls itself recursively? The handler must handle these cases correctly:

Reentrancy Scenario

Textfunc_A (peekaboo) calls func_B (peekaboo):

1. func_A prologue: decrypt func_A body
2. func_A executes, calls func_B
3. func_B prologue: decrypt func_B body
   (func_A is still decrypted - it's on the call stack)
4. func_B executes
5. func_B epilogue: re-encrypt func_B body
6. Return to func_A (still decrypted, continues executing)
7. func_A epilogue: re-encrypt func_A body

The handler uses the TEB UserReserved fields (discussed in Module 6) to track the currently active function. When func_B’s prologue runs, it sees that func_A is already decrypted and handles the nesting correctly by maintaining a reference counter or stack of active functions.

The Two-Decrypted Problem

In the scenario above, between steps 3 and 5, both func_A and func_B are decrypted simultaneously. This is unavoidable — func_A’s code is on the call stack and will be needed when func_B returns. This means the “at most one function decrypted” guarantee is actually “at most one function decrypted per call chain depth.” In practice, the call depth is small relative to the total number of registered functions, so ~98% coverage still holds.

8. Stub Size and Performance Impact

Component	Size	Performance Impact
Prologue stub	~0x46 bytes (70 bytes)	~50-100ns per function call (register saves + handler call)
Epilogue stub	~0x40 bytes (64 bytes) per return point	~50-100ns per function return
Handler execution	N/A (shared code)	~1-10μs per invocation (XOR loop + VirtualProtect calls)
Total per call	—	~2-20μs (prologue + epilogue + two handler calls)

The overhead is dominated by the VirtualProtect system calls within the handler (Module 6). The XOR operation itself is extremely fast. For functions called infrequently (C2 check-in, command dispatch, credential handling), this overhead is negligible. For hot-loop functions called millions of times, it would be significant — which is why the attribute-based opt-in approach lets developers exclude performance-critical functions.

9. Position Independence Verification

The stubs must work correctly regardless of ASLR (Address Space Layout Randomization). To verify position independence, examine every memory reference in the stubs:

PIC Checklist

No absolute addresses — all addresses derived from CALL/POP at runtime
No relocations needed — the stub bytes are the same regardless of load address
Handler address — resolved via PE header traversal or TEB field, not hard-coded
Function body offset — known at compile time as a relative offset from the stub (constant 0x46)
Stack references — all via RSP-relative addressing (inherently position-independent)

Knowledge Check

Q1: What is the purpose of the CALL/POP trick in the prologue stub?

A) To call the handler function directly

B) To obtain the stub's own runtime address for position-independent addressing

C) To pop the return address and prevent stack traces

D) To encrypt the function body inline

Q2: What is the key difference between the prologue and epilogue handler calls?

A) The epilogue uses a different handler function

B) The prologue saves more registers than the epilogue

C) The epilogue does not need PIC addressing

D) The prologue passes flag 0 (decrypt), the epilogue passes flag 1 (encrypt)

Q3: In a call chain where peekaboo func_A calls peekaboo func_B, how many functions are decrypted simultaneously?

A) Zero — the handler prevents this

B) One — func_A is re-encrypted when func_B starts

C) Two — both must be decrypted because func_A is still on the call stack

D) All registered functions are decrypted when any function runs

← Previous: Registration & X86RetModPass Next: The Handler & XOR Engine →