Difficulty: Intermediate

Module 4: Stack Desynchronization Theory

Separating what the CPU actually does from what the stack says happened.

The Core Innovation

SilentMoonwalk's key insight is stack desynchronization: the physical stack layout seen by a stack walker does NOT have to reflect the actual execution path. By carefully constructing the stack, you can make the unwinder see call chain A → B → C → D while the actual execution path was completely different. This module explains the theory behind this separation.

Synchronized vs Desynchronized Stacks

In normal execution, the stack is synchronized — the return addresses on the stack directly correspond to the actual call chain. If main() calls foo() which calls bar(), the stack unwinder will see exactly that chain.

A desynchronized stack breaks this correspondence. The execution might follow path X → Y → Z, but the stack is constructed so the unwinder sees A → B → C. The two views are decoupled:

Synchronized vs Desynchronized

Synchronized (Normal)

Execution: main → foo → bar

Stack shows: main → foo → bar

Match: YES

Desynchronized (SilentMoonwalk)

Execution: shellcode → ROP → syscall

Stack shows: RtlUserThreadStart → BaseThreadInitThunk → kernel32!SleepEx → ...

Match: NO (by design)

Why Desynchronization Is Possible

Stack desynchronization is possible because of a fundamental property of the x64 unwinding mechanism: the unwinder is stateless. It doesn't track what functions were actually called. It only looks at:

The current RIP (instruction pointer)
The current RSP (stack pointer)
The RUNTIME_FUNCTION / UNWIND_INFO for the function containing RIP
The values at specific stack offsets (as dictated by the unwind codes)

If you arrange these four things to be internally consistent, the unwinder will happily produce a clean call chain — regardless of whether those functions were ever actually called.

The Unwinder's Trust Model

RtlVirtualUnwind trusts that the stack has not been tampered with. It assumes that if RIP is inside function F at offset X, then F's prologue has already executed, and the stack contains F's saved registers and local variables at the offsets described by F's UNWIND_INFO. SilentMoonwalk exploits this trust by placing values at exactly the right offsets.

The SilentMoonwalk Approach

SilentMoonwalk achieves desynchronization through a multi-step process. At a high level:

Step 1: Select Target Functions

Choose a set of legitimate functions from ntdll.dll or kernel32.dll whose call chain would be plausible. For example, a sleeping thread might reasonably show:

Textntdll!NtWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx+0x8e
kernel32!SleepEx+0x63
kernel32!Sleep+0x9
SomeApp!WorkerThread+0x42
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Step 2: Compute Frame Sizes

For each target function, parse its RUNTIME_FUNCTION and UNWIND_INFO to determine the exact frame size. The frame size is the total amount of stack space allocated by the function's prologue (SUB RSP + pushed registers). Each frame must occupy exactly this many bytes on the spoofed stack.

C++// Computing frame size from UNWIND_INFO
// This determines how much stack space a function occupies
DWORD ComputeFrameSize(PUNWIND_INFO pUnwind) {
    DWORD frameSize = 0;

    for (UBYTE i = 0; i < pUnwind->CountOfCodes; i++) {
        UNWIND_CODE code = pUnwind->UnwindCode[i];
        switch (code.UnwindOp) {
            case UWOP_PUSH_NONVOL:   // 0
                frameSize += 8;       // Each push adds 8 bytes
                break;
            case UWOP_ALLOC_LARGE:   // 1
                if (code.OpInfo == 0) {
                    frameSize += pUnwind->UnwindCode[++i].FrameOffset * 8;
                } else {
                    DWORD size = *(DWORD*)&pUnwind->UnwindCode[i+1];
                    frameSize += size;
                    i += 2;
                }
                break;
            case UWOP_ALLOC_SMALL:   // 2
                frameSize += (code.OpInfo * 8) + 8;
                break;
            // UWOP_SET_FPREG, UWOP_SAVE_NONVOL, etc. handled similarly
        }
    }

    return frameSize;  // Total bytes between RSP and return address
}

Step 3: Construct the Fake Stack

Lay out the spoofed frames contiguously on the stack. Each frame has the exact size computed from the target function's unwind codes. At the top of each frame (from the unwinder's perspective), place the return address pointing into the next target function in the chain.

Synthetic Stack Layout

Low address (RSP points here)

Frame 0 — "inside" NtWaitForSingleObject
Size: matches UNWIND_INFO of NtWaitForSingleObject
Return addr at computed offset → points into WaitForSingleObjectEx

Frame 1 — "inside" WaitForSingleObjectEx
Size: matches UNWIND_INFO of WaitForSingleObjectEx
Return addr → points into SleepEx

Frame 2 — "inside" SleepEx
Size: matches UNWIND_INFO of SleepEx
Return addr → points into BaseThreadInitThunk

Frame 3 — "inside" BaseThreadInitThunk
Size: matches UNWIND_INFO
Return addr → points into RtlUserThreadStart

Frame 4 — RtlUserThreadStart (terminal frame)
Unwinder stops here.

High address (stack bottom)

Step 4: Use ROP to Actually Execute

The tricky part: this same stack layout must also function as a ROP chain for actual execution. SilentMoonwalk places gadget addresses at specific positions within each frame so that during execution, RSP advances through the frames via ADD RSP, N; RET gadgets, while from the unwinder's perspective, those same positions look like legitimate frame contents.

The Frame Size Problem

The hardest constraint in stack desynchronization is frame size matching. Consider this scenario:

TextSleepEx's UNWIND_INFO says:
  UWOP_ALLOC_SMALL: 0x28 bytes (OpInfo=4, so (4*8)+8 = 40 = 0x28)
  UWOP_PUSH_NONVOL: RBX (adds 8 bytes)
  UWOP_PUSH_NONVOL: RSI (adds 8 bytes)

Total frame size: 0x28 + 8 + 8 = 0x38 bytes (56 bytes)
Plus 8 bytes for return address = 0x40 (64 bytes) total between caller RSP and callee RSP

This means in our synthetic stack, the "SleepEx frame" must occupy
exactly 0x40 bytes. Not 0x38, not 0x48. Exactly 0x40.

The Non-Volatile Register Constraint

It's not enough to just get the frame size right. If the unwind codes include UWOP_PUSH_NONVOL for RBX and RSI, the unwinder will read values from those stack positions and restore them into the register context. If those positions contain garbage or obviously wrong values (like 0xDEADBEEF), a sophisticated analyzer might flag the frame as synthetic. SilentMoonwalk must place plausible values for saved registers.

Execution Flow vs Unwind Flow

The key to understanding SilentMoonwalk is recognizing that execution and unwinding traverse the stack in opposite ways:

Property	Execution (ROP chain)	Unwinding (RtlVirtualUnwind)
Direction	RSP increases (moves toward stack bottom)	RSP increases (reverses prologue)
Driven by	RET instructions popping addresses	UNWIND_CODE processing
What it reads	Gadget addresses from the stack	Return addresses + saved registers
When it happens	During actual code execution	When EDR/OS walks the stack
State tracking	CPU registers (RSP, RIP, etc.)	CONTEXT structure (simulated)

SilentMoonwalk's genius is making both traversals work simultaneously on the same physical memory layout.

Comparison with Prior Approaches

ThreadStackSpoofer (Gen 1)

ThreadStackSpoofer by mgeeky simply overwrites the return address with NULL or a legitimate address before sleeping, then restores it after waking. This is a single-frame manipulation — it only modifies one return address. The unwinder hits the modified frame and either stops (NULL) or tries to unwind the spoofed function (usually failing because frame sizes don't match).

CallStackSpoofingPOC (Gen 2)

CallStackSpoofingPOC by pard0p uses a single ROP gadget (typically ADD RSP, 0x?? ; RET) to bridge between the real and fake parts of the stack. It constructs one fake frame. However, it still only spoofs one or two frames, not the entire chain.

SilentMoonwalk (Gen 3)

SilentMoonwalk constructs the entire call chain — every frame from the current position all the way back to RtlUserThreadStart. Each frame has correct size, plausible saved register values, and a return address pointing to a real instruction inside the target function. This is what makes it "fully dynamic" — it generates the spoofed stack on the fly based on the actual UNWIND_INFO of target functions.

Dynamic vs Static Spoofing

Earlier tools used hardcoded offsets and specific function addresses, making them brittle across Windows versions. SilentMoonwalk dynamically parses the .pdata section at runtime, computing frame sizes from the actual UNWIND_INFO structures. This means it works across Windows versions without needing to update offsets — as long as the target functions exist and have valid unwind data.

← Previous: ROP Fundamentals Next: Gadget Discovery & Selection →