Difficulty: Intermediate

Module 4: Stack Desynchronization Theory

Separating what the CPU actually does from what the stack says happened.

The Core Innovation

SilentMoonwalk's key insight is stack desynchronization: the physical stack layout seen by a stack walker does NOT have to reflect the actual execution path. By carefully constructing the stack, you can make the unwinder see call chain A → B → C → D while the actual execution path was completely different. This module explains the theory behind this separation.

Synchronized vs Desynchronized Stacks

In normal execution, the stack is synchronized — the return addresses on the stack directly correspond to the actual call chain. If main() calls foo() which calls bar(), the stack unwinder will see exactly that chain.

A desynchronized stack breaks this correspondence. The execution might follow path X → Y → Z, but the stack is constructed so the unwinder sees A → B → C. The two views are decoupled:

Synchronized vs Desynchronized

Synchronized (Normal)

Execution: main → foo → bar
Stack shows: main → foo → bar
Match: YES

Desynchronized (SilentMoonwalk)

Execution: shellcode → ROP → syscall
Stack shows: RtlUserThreadStart → BaseThreadInitThunk → kernel32!SleepEx → ...
Match: NO (by design)

Why Desynchronization Is Possible

Stack desynchronization is possible because of a fundamental property of the x64 unwinding mechanism: the unwinder is stateless. It doesn't track what functions were actually called. It only looks at:

  1. The current RIP (instruction pointer)
  2. The current RSP (stack pointer)
  3. The RUNTIME_FUNCTION / UNWIND_INFO for the function containing RIP
  4. The values at specific stack offsets (as dictated by the unwind codes)

If you arrange these four things to be internally consistent, the unwinder will happily produce a clean call chain — regardless of whether those functions were ever actually called.

The Unwinder's Trust Model

RtlVirtualUnwind trusts that the stack has not been tampered with. It assumes that if RIP is inside function F at offset X, then F's prologue has already executed, and the stack contains F's saved registers and local variables at the offsets described by F's UNWIND_INFO. SilentMoonwalk exploits this trust by placing values at exactly the right offsets.

The SilentMoonwalk Approach

SilentMoonwalk achieves desynchronization through a multi-step process. At a high level:

Step 1: Select Target Functions

Choose a set of legitimate functions from ntdll.dll or kernel32.dll whose call chain would be plausible. For example, a sleeping thread might reasonably show:

Textntdll!NtWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx+0x8e
kernel32!SleepEx+0x63
kernel32!Sleep+0x9
SomeApp!WorkerThread+0x42
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Step 2: Compute Frame Sizes

For each target function, parse its RUNTIME_FUNCTION and UNWIND_INFO to determine the exact frame size. The frame size is the total amount of stack space allocated by the function's prologue (SUB RSP + pushed registers). Each frame must occupy exactly this many bytes on the spoofed stack.

C++// Computing frame size from UNWIND_INFO
// This determines how much stack space a function occupies
DWORD ComputeFrameSize(PUNWIND_INFO pUnwind) {
    DWORD frameSize = 0;

    for (UBYTE i = 0; i < pUnwind->CountOfCodes; i++) {
        UNWIND_CODE code = pUnwind->UnwindCode[i];
        switch (code.UnwindOp) {
            case UWOP_PUSH_NONVOL:   // 0
                frameSize += 8;       // Each push adds 8 bytes
                break;
            case UWOP_ALLOC_LARGE:   // 1
                if (code.OpInfo == 0) {
                    frameSize += pUnwind->UnwindCode[++i].FrameOffset * 8;
                } else {
                    DWORD size = *(DWORD*)&pUnwind->UnwindCode[i+1];
                    frameSize += size;
                    i += 2;
                }
                break;
            case UWOP_ALLOC_SMALL:   // 2
                frameSize += (code.OpInfo * 8) + 8;
                break;
            // UWOP_SET_FPREG, UWOP_SAVE_NONVOL, etc. handled similarly
        }
    }

    return frameSize;  // Total bytes between RSP and return address
}

Step 3: Construct the Fake Stack

Lay out the spoofed frames contiguously on the stack. Each frame has the exact size computed from the target function's unwind codes. At the top of each frame (from the unwinder's perspective), place the return address pointing into the next target function in the chain.

Synthetic Stack Layout

Low address (RSP points here)
Frame 0 — "inside" NtWaitForSingleObject
Size: matches UNWIND_INFO of NtWaitForSingleObject
Return addr at computed offset → points into WaitForSingleObjectEx
Frame 1 — "inside" WaitForSingleObjectEx
Size: matches UNWIND_INFO of WaitForSingleObjectEx
Return addr → points into SleepEx
Frame 2 — "inside" SleepEx
Size: matches UNWIND_INFO of SleepEx
Return addr → points into BaseThreadInitThunk
Frame 3 — "inside" BaseThreadInitThunk
Size: matches UNWIND_INFO
Return addr → points into RtlUserThreadStart
Frame 4 — RtlUserThreadStart (terminal frame)
Unwinder stops here.
High address (stack bottom)

Step 4: Use ROP to Actually Execute

The tricky part: this same stack layout must also function as a ROP chain for actual execution. SilentMoonwalk places gadget addresses at specific positions within each frame so that during execution, RSP advances through the frames via ADD RSP, N; RET gadgets, while from the unwinder's perspective, those same positions look like legitimate frame contents.

The Frame Size Problem

The hardest constraint in stack desynchronization is frame size matching. Consider this scenario:

TextSleepEx's UNWIND_INFO says:
  UWOP_ALLOC_SMALL: 0x28 bytes (OpInfo=4, so (4*8)+8 = 40 = 0x28)
  UWOP_PUSH_NONVOL: RBX (adds 8 bytes)
  UWOP_PUSH_NONVOL: RSI (adds 8 bytes)

Total frame size: 0x28 + 8 + 8 = 0x38 bytes (56 bytes)
Plus 8 bytes for return address = 0x40 (64 bytes) total between caller RSP and callee RSP

This means in our synthetic stack, the "SleepEx frame" must occupy
exactly 0x40 bytes. Not 0x38, not 0x48. Exactly 0x40.

The Non-Volatile Register Constraint

It's not enough to just get the frame size right. If the unwind codes include UWOP_PUSH_NONVOL for RBX and RSI, the unwinder will read values from those stack positions and restore them into the register context. If those positions contain garbage or obviously wrong values (like 0xDEADBEEF), a sophisticated analyzer might flag the frame as synthetic. SilentMoonwalk must place plausible values for saved registers.

Execution Flow vs Unwind Flow

The key to understanding SilentMoonwalk is recognizing that execution and unwinding traverse the stack in opposite ways:

PropertyExecution (ROP chain)Unwinding (RtlVirtualUnwind)
DirectionRSP increases (moves toward stack bottom)RSP increases (reverses prologue)
Driven byRET instructions popping addressesUNWIND_CODE processing
What it readsGadget addresses from the stackReturn addresses + saved registers
When it happensDuring actual code executionWhen EDR/OS walks the stack
State trackingCPU registers (RSP, RIP, etc.)CONTEXT structure (simulated)

SilentMoonwalk's genius is making both traversals work simultaneously on the same physical memory layout.

Comparison with Prior Approaches

ThreadStackSpoofer (Gen 1)

ThreadStackSpoofer by mgeeky simply overwrites the return address with NULL or a legitimate address before sleeping, then restores it after waking. This is a single-frame manipulation — it only modifies one return address. The unwinder hits the modified frame and either stops (NULL) or tries to unwind the spoofed function (usually failing because frame sizes don't match).

CallStackSpoofingPOC (Gen 2)

CallStackSpoofingPOC by pard0p uses a single ROP gadget (typically ADD RSP, 0x?? ; RET) to bridge between the real and fake parts of the stack. It constructs one fake frame. However, it still only spoofs one or two frames, not the entire chain.

SilentMoonwalk (Gen 3)

SilentMoonwalk constructs the entire call chain — every frame from the current position all the way back to RtlUserThreadStart. Each frame has correct size, plausible saved register values, and a return address pointing to a real instruction inside the target function. This is what makes it "fully dynamic" — it generates the spoofed stack on the fly based on the actual UNWIND_INFO of target functions.

Dynamic vs Static Spoofing

Earlier tools used hardcoded offsets and specific function addresses, making them brittle across Windows versions. SilentMoonwalk dynamically parses the .pdata section at runtime, computing frame sizes from the actual UNWIND_INFO structures. This means it works across Windows versions without needing to update offsets — as long as the target functions exist and have valid unwind data.

Pop Quiz: Desynchronization Theory

Q1: Why can the unwinder be fooled by a synthetic stack?

RtlVirtualUnwind is purely stateless. At each iteration, it looks up the RUNTIME_FUNCTION for the current RIP, applies the unwind codes to RSP, and reads the return address. It has no memory of what was called before and no way to verify that the function was actually invoked.

Q2: What is the "frame size problem" in stack desynchronization?

If the target function's UNWIND_INFO says the prologue allocated N bytes, the unwinder will add N to RSP when processing that frame. The synthetic frame must be exactly N bytes so that after the unwinder adjusts RSP, it lands precisely on the next frame's return address.

Q3: How does SilentMoonwalk differ from ThreadStackSpoofer?

ThreadStackSpoofer performs a simple single-frame return address overwrite. SilentMoonwalk constructs a complete synthetic call chain where every frame has the correct size, valid return addresses, and plausible saved register values, making the entire stack walkable by RtlVirtualUnwind.