Module 5: Prologue & Epilogue Stubs
The injected code at function boundaries that enables transparent decrypt-on-call and re-encrypt-on-return.
Module Objective
Understand the exact structure of the prologue and epilogue stubs injected by X86RetModPass: the 0x46-byte prologue with its CALL/POP PIC trick, register preservation, handler invocation, and the epilogue stub that replaces each RET instruction. Learn how these stubs achieve position independence and transparent operation.
1. Prologue Stub Overview
The prologue stub is a 0x46-byte (70-byte) sequence prepended to every registered function. It runs before the function’s original code and performs one critical task: call the handler to decrypt the function body.
Prologue Stub Layout
TextFunction Start (as seen by callers)
+-----------------------------------------------+
| Bytes 0x00-0x05: CALL/POP PIC trick | ← Get current RIP
| Bytes 0x06-0x20: Save registers (push) | ← Preserve all volatile regs
| Bytes 0x21-0x30: Save flags (pushfq) |
| Bytes 0x31-0x3A: Set up handler args | ← Pass "decrypt" flag + function ptr
| Bytes 0x3B-0x3F: CALL handler | ← Invoke the shared handler
| Bytes 0x40-0x44: Restore flags + registers | ← Restore original state
| Byte 0x45: Fall through | ← Continue to decrypted function body
+-----------------------------------------------+
| Original Function Body (now decrypted) |
+-----------------------------------------------+
The key challenge is that the stub must work regardless of where the function is loaded in memory (ASLR). This is solved with the CALL/POP trick for position-independent code (PIC).
2. The CALL/POP PIC Trick
Position-independent code needs to know its own address at runtime. In x86-64, the standard trick is:
x86-64 Assembly; CALL/POP trick for position-independent addressing
prologue_start:
call next_instruction ; CALL pushes return address (RIP+5) onto stack
next_instruction:
pop rbx ; POP retrieves the address of 'next_instruction'
; Now RBX = address of 'next_instruction' at runtime
; We can calculate any relative offset from here
The CALL instruction pushes the address of the next instruction onto the stack. The immediately following POP retrieves it into a register. Now the stub knows its own address and can calculate relative offsets to:
- The function body start (prologue start + 0x46)
- The handler function (known offset from the stub)
- The
.funcmetasection (via PE header walking)
Why Not LEA with RIP-Relative?
x86-64 supports LEA RAX, [RIP + offset] which is also position-independent. However, at the time the stub is emitted (PreEmit phase), the exact distance to the handler may not be finalized (the linker hasn’t placed sections yet). The CALL/POP trick works without knowing the absolute distance at compile time — the handler address can be resolved through PE header traversal at runtime.
3. Register Preservation in the Prologue
The prologue must save and restore ALL registers it touches. Under the Microsoft x64 ABI, the function’s callers may have placed arguments in RCX, RDX, R8, R9, and on the stack. The prologue must not disturb any of these:
x86-64 Assembly; Complete prologue stub (simplified)
prologue:
; === PIC: Get our own address ===
call .Lnext
.Lnext:
pop rbx ; RBX = runtime address of .Lnext
; === Save ALL volatile registers ===
push rax
push rcx ; Arg 1 (MS x64 ABI)
push rdx ; Arg 2
push r8 ; Arg 3
push r9 ; Arg 4
push r10
push r11
pushfq ; Save CPU flags (CF, ZF, SF, OF, etc.)
; === Set up handler arguments ===
; RCX = pointer to this function's body (RBX + offset_to_body)
lea rcx, [rbx + BODY_OFFSET]
; RDX = operation flag: 0 = decrypt, 1 = encrypt
xor rdx, rdx ; 0 = decrypt
; === Call the handler ===
; Handler address is resolved via TEB or embedded offset
call handler_address
; === Restore ALL registers ===
popfq
pop r11
pop r10
pop r9
pop r8
pop rdx
pop rcx
pop rax
pop rbx ; Restore RBX used by CALL/POP
; === Fall through to function body (now decrypted) ===
function_body_start:
; Original function instructions begin here
RBX Is Non-Volatile
Under the Microsoft x64 calling convention, RBX is a non-volatile (callee-saved) register. The prologue saves and restores it explicitly because the CALL/POP trick uses it. The function body (generated by the compiler) also saves/restores RBX if it uses it, so there’s no conflict — the prologue’s push/pop of RBX happens before the compiler’s own frame setup.
4. Epilogue Stub Overview
Every RET instruction in the function is replaced with an epilogue stub. The epilogue performs the inverse of the prologue: it calls the handler to re-encrypt the function body, then executes the original return.
x86-64 Assembly; Epilogue stub (replaces each RET instruction)
epilogue:
; === Save registers (same set as prologue) ===
push rbx
push rax
push rcx
push rdx
push r8
push r9
push r10
push r11
pushfq
; === PIC: Get our address ===
call .Lepinext
.Lepinext:
pop rbx
; === Set up handler arguments ===
lea rcx, [rbx - EPILOGUE_TO_BODY_OFFSET] ; Pointer to function body
mov rdx, 1 ; 1 = encrypt (re-mask)
; === Call handler ===
call handler_address
; === Restore registers ===
popfq
pop r11
pop r10
pop r9
pop r8
pop rdx
pop rcx
pop rax
pop rbx
; === Execute the original return ===
ret
The critical difference from the prologue is the operation flag: RDX = 1 tells the handler to encrypt rather than decrypt.
5. Return Value Preservation
When a function returns a value, it is placed in RAX (integers/pointers) or XMM0 (floating point) per the Microsoft x64 ABI. The epilogue stub must preserve this return value:
RAX Preservation
Notice in the epilogue that RAX is pushed before the handler call and popped after. This ensures the return value survives the re-encryption process. The handler itself does not return a meaningful value in RAX (or if it does, it’s overwritten by the pop), so the function’s return value is correctly passed to the caller.
C++// Example: function returns an int
__attribute__((annotate("peekaboo")))
int compute_hash(const char* data) {
int hash = 0;
// ... compute hash ...
return hash; // hash is in RAX when RET executes
}
// After instrumentation:
// 1. Prologue decrypts function body
// 2. compute_hash runs, puts result in RAX
// 3. Before RET: epilogue saves RAX, re-encrypts body, restores RAX
// 4. RET: caller receives correct hash value in RAX
6. Stack Frame Interaction
The prologue stub runs before the compiler-generated stack frame setup (the standard push rbp; mov rbp, rsp; sub rsp, N sequence). The epilogue stub runs after the compiler-generated frame teardown but before the actual RET:
TextExecution order:
CALL from caller
|
v
[Prologue Stub] ← Decrypt function
|
v
[Compiler Frame Setup] ← push rbp; mov rbp, rsp; sub rsp, ...
|
v
[Function Body] ← Original code
|
v
[Compiler Frame Teardown] ← add rsp; pop rbp
|
v
[Epilogue Stub] ← Re-encrypt function
|
v
[RET] ← Return to caller
This ordering is important because the prologue stub’s register pushes are balanced by its own pops before the compiler’s frame setup. The stack is in the exact state the compiler expects when the frame setup code runs.
7. Recursive and Reentrant Calls
What happens if a peekaboo function calls another peekaboo function? Or calls itself recursively? The handler must handle these cases correctly:
Reentrancy Scenario
Textfunc_A (peekaboo) calls func_B (peekaboo):
1. func_A prologue: decrypt func_A body
2. func_A executes, calls func_B
3. func_B prologue: decrypt func_B body
(func_A is still decrypted - it's on the call stack)
4. func_B executes
5. func_B epilogue: re-encrypt func_B body
6. Return to func_A (still decrypted, continues executing)
7. func_A epilogue: re-encrypt func_A body
The handler uses the TEB UserReserved fields (discussed in Module 6) to track the currently active function. When func_B’s prologue runs, it sees that func_A is already decrypted and handles the nesting correctly by maintaining a reference counter or stack of active functions.
The Two-Decrypted Problem
In the scenario above, between steps 3 and 5, both func_A and func_B are decrypted simultaneously. This is unavoidable — func_A’s code is on the call stack and will be needed when func_B returns. This means the “at most one function decrypted” guarantee is actually “at most one function decrypted per call chain depth.” In practice, the call depth is small relative to the total number of registered functions, so ~98% coverage still holds.
8. Stub Size and Performance Impact
| Component | Size | Performance Impact |
|---|---|---|
| Prologue stub | ~0x46 bytes (70 bytes) | ~50-100ns per function call (register saves + handler call) |
| Epilogue stub | ~0x40 bytes (64 bytes) per return point | ~50-100ns per function return |
| Handler execution | N/A (shared code) | ~1-10μs per invocation (XOR loop + VirtualProtect calls) |
| Total per call | — | ~2-20μs (prologue + epilogue + two handler calls) |
The overhead is dominated by the VirtualProtect system calls within the handler (Module 6). The XOR operation itself is extremely fast. For functions called infrequently (C2 check-in, command dispatch, credential handling), this overhead is negligible. For hot-loop functions called millions of times, it would be significant — which is why the attribute-based opt-in approach lets developers exclude performance-critical functions.
9. Position Independence Verification
The stubs must work correctly regardless of ASLR (Address Space Layout Randomization). To verify position independence, examine every memory reference in the stubs:
PIC Checklist
- No absolute addresses — all addresses derived from CALL/POP at runtime
- No relocations needed — the stub bytes are the same regardless of load address
- Handler address — resolved via PE header traversal or TEB field, not hard-coded
- Function body offset — known at compile time as a relative offset from the stub (constant 0x46)
- Stack references — all via RSP-relative addressing (inherently position-independent)
Knowledge Check
Q1: What is the purpose of the CALL/POP trick in the prologue stub?
Q2: What is the key difference between the prologue and epilogue handler calls?
Q3: In a call chain where peekaboo func_A calls peekaboo func_B, how many functions are decrypted simultaneously?