Difficulty: Advanced

Module 7: Argument Marshalling & Syscall Execution

Registers from the past, a stack from the present — the Frankenstein context that fools kernel telemetry.

The Critical Moment

When the three-condition algorithm (Module 6) reaches IsSubRsp == 2, we hold two things: a saved context from the original syscall (containing all the real arguments) and a live stack with genuine call frames from the MessageBoxW chain. The context swap merges these into a single execution state: real arguments + legitimate call stack. This is the moment everything comes together.

The Context Swap

When IsSubRsp reaches 2, the handler performs the most critical operation in the entire technique — the context swap. This is a carefully ordered sequence that preserves exactly what we need from each context.

C++// IsSubRsp == 2: All three conditions met
if (IsSubRsp == 2) {
    // Step 1: Save the CURRENT RSP (legitimate call stack)
    ULONG_PTR TempRsp = ExceptionInfo->ContextRecord->Rsp;

    // Step 2: Restore the SAVED context (real syscall arguments)
    //         This overwrites ALL registers including RSP
    memcpy(ExceptionInfo->ContextRecord,
           &SavedContext,
           sizeof(CONTEXT));

    // Step 3: Replace RSP with the legitimate stack pointer
    //         This keeps the genuine call frames from MessageBoxW chain
    ExceptionInfo->ContextRecord->Rsp = TempRsp;

    // ... (syscall emulation and argument copying follow)
}

Step-by-Step Breakdown

Step	What Changes	Why
Save TempRsp	Copy current RSP to a local variable	The current RSP points to the legitimate call stack (MessageBoxW → user32 → ntdll). We must not lose this.
Restore SavedContext	Overwrite the entire CONTEXT with the saved snapshot	This restores RCX, RDX, R8, R9 (first 4 arguments), RAX, and all other registers to their values at the original syscall breakpoint
Replace RSP	Overwrite RSP with TempRsp	The saved RSP pointed to the wrapper function's stack. We replace it with the legitimate stack so the call frames above us are from the MessageBoxW chain.

Context Swap Visualized

SavedContext (from Phase 2)

RCX = arg1 (real)

RDX = arg2 (real)

R8 = arg3 (real)

R9 = arg4 (real)

RSP = wrapper stack (discard)

Current Context (from trace)

RCX = MessageBoxW junk (discard)

RDX = MessageBoxW junk (discard)

R8 = MessageBoxW junk (discard)

R9 = MessageBoxW junk (discard)

RSP = legitimate stack (keep!)

Green = kept, Red = discarded. The final state has real arguments + legitimate stack.

Syscall Emulation

After the context swap restores the real arguments, the handler must set up the CPU state exactly as if the ntdll syscall stub had executed normally. This means emulating the two instructions that the stub performs before the syscall opcode:

x86-64 ASM;; Normal ntdll stub (what we are emulating):
mov r10, rcx        ; Save first argument (syscall clobbers RCX)
mov eax, <SSN>      ; Load System Service Number
syscall              ; Enter kernel

C++// Emulate: mov r10, rcx
ExceptionInfo->ContextRecord->R10 =
    ExceptionInfo->ContextRecord->Rcx;

// Emulate: mov eax, SSN
ExceptionInfo->ContextRecord->Rax = SyscallNo;

// Point RIP directly at the syscall instruction
ExceptionInfo->ContextRecord->Rip =
    SyscallEntryAddr + OPCODE_SYSCALL_OFF;

Why R10 = RCX?

The x64 syscall instruction is destructive: it saves the return address in RCX (overwriting whatever was there) and saves RFLAGS in R11. This means the first argument (originally in RCX per the Windows x64 calling convention) would be lost. The ntdll stub copies RCX to R10 before the syscall so the kernel can read the first argument from R10 instead. LayeredSyscall must replicate this behavior.

Register	Value Set	Purpose
`R10`	Copy of RCX (first argument)	Kernel reads arg1 from R10 since `syscall` clobbers RCX
`RAX`	System Service Number (SSN)	Kernel uses RAX to index the System Service Descriptor Table (SSDT)
`RIP`	`SyscallEntryAddr + OPCODE_SYSCALL_OFF`	Execution resumes directly at the `syscall` instruction inside ntdll

Extended Arguments (5th through 12th)

The x64 Windows calling convention passes the first four arguments in registers (RCX, RDX, R8, R9). Any additional arguments go on the stack. The context swap restored the registers but used the legitimate stack (TempRsp). That stack does not have the original arguments 5+. They must be copied from the saved stack.

C++if (ExtendedArgs) {
    ULONG_PTR Rsp      = ExceptionInfo->ContextRecord->Rsp;  // Legitimate stack
    ULONG_PTR SavedRsp = SavedContext.Rsp;                     // Original stack

    // Copy arguments 5 through 12 from saved stack to legitimate stack
    *(ULONG_PTR*)(Rsp + FIFTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + FIFTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + SIXTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + SIXTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + SEVENTH_ARGUMENT)   = *(ULONG_PTR*)(SavedRsp + SEVENTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + EIGHTH_ARGUMENT)    = *(ULONG_PTR*)(SavedRsp + EIGHTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + NINTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + NINTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + TENTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + TENTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + ELEVENTH_ARGUMENT)  = *(ULONG_PTR*)(SavedRsp + ELEVENTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + TWELVETH_ARGUMENT)  = *(ULONG_PTR*)(SavedRsp + TWELVETH_ARGUMENT);
}

Argument Offset Table

Argument	Passing Method	Stack Offset	Hex
1st (arg1)	RCX register	—	—
2nd (arg2)	RDX register	—	—
3rd (arg3)	R8 register	—	—
4th (arg4)	R9 register	—	—
5th	Stack	RSP + 0x28	`FIFTH_ARGUMENT`
6th	Stack	RSP + 0x30	`SIXTH_ARGUMENT`
7th	Stack	RSP + 0x38	`SEVENTH_ARGUMENT`
8th	Stack	RSP + 0x40	`EIGHTH_ARGUMENT`
9th	Stack	RSP + 0x48	`NINTH_ARGUMENT`
10th	Stack	RSP + 0x50	`TENTH_ARGUMENT`
11th	Stack	RSP + 0x58	`ELEVENTH_ARGUMENT`
12th	Stack	RSP + 0x60	`TWELVETH_ARGUMENT`

Why 0x28 for the 5th Argument?

On x64 Windows, the caller allocates 32 bytes (0x20) of shadow space (also called "home space") on the stack even for the first 4 register arguments. The return address occupies another 8 bytes at RSP + 0x00. So the 5th argument starts at RSP + 0x28 (shadow space 0x20 + return address 0x08 = 0x28). Each subsequent argument is 8 bytes further.

Why Not Always Copy?

Functions with 4 or fewer arguments (like NtClose with 1 argument) do not use stack arguments. The ExtendedArgs flag avoids unnecessary memory writes for these simple functions. This is both a performance optimization and a safety measure — writing to stack locations that are not expected to hold arguments could corrupt other data.

Clearing the Trap Flag

After the context swap and syscall emulation are complete, the handler must clear the Trap Flag. If it remains set, every instruction of the real syscall stub (and potentially kernel code) would trigger SINGLE_STEP exceptions, which would be catastrophic.

C++// Clear the Trap Flag - stop single-stepping
ExceptionInfo->ContextRecord->EFlags &= ~0x100;

// Reset the state machine for the next wrapped syscall
IsSubRsp = 0;

return EXCEPTION_CONTINUE_EXECUTION;
// Execution resumes at the syscall instruction with:
//   - Real arguments in registers (RCX, RDX, R8, R9)
//   - R10 = RCX (syscall convention)
//   - RAX = SSN
//   - RSP pointing to legitimate call stack with args 5+ copied
//   - RIP at the syscall instruction inside ntdll

What Happens Next

The syscall instruction executes inside ntdll.dll memory, with genuine call stack frames above it, correct arguments, and the proper SSN in RAX. The kernel processes the request normally. To the kernel's ETW telemetry, this looks like a completely legitimate system call from a standard API chain.

Phase 4: Clean Return via Dr1

After the kernel completes the syscall, execution returns to the ret instruction in the ntdll stub (immediately after the syscall instruction). The Dr1 hardware breakpoint fires, generating another EXCEPTION_SINGLE_STEP.

C++// Inside HandlerHwBp - Phase 4: Return breakpoint
if (ExceptionInfo->ContextRecord->Rip ==
        SyscallEntryAddr + OPCODE_SYSCALL_RET_OFF)
{
    // 1. Disable Dr1 breakpoint (clear bit 2 in Dr7)
    ExceptionInfo->ContextRecord->Dr7 &= ~(1 << 2);

    // 2. Restore the original RSP from SavedContext
    //    This points back to the wrapper function's stack frame
    ExceptionInfo->ContextRecord->Rsp = SavedContext.Rsp;

    // 3. Continue execution - the 'ret' instruction will now
    //    return to the wrapper function as if nothing happened
    return EXCEPTION_CONTINUE_EXECUTION;
}

Why Restore RSP Here?

During the syscall, RSP pointed to the legitimate stack (from the MessageBoxW chain). But the wrapper function expects to resume with its own stack frame. By restoring SavedContext.Rsp, the ret instruction pops the correct return address and execution returns to the wrapper function. The return value (NTSTATUS) is in RAX, exactly where the wrapper expects it.

Clean Return Flow

Kernel returns
RAX = NTSTATUS

→

ret [Dr1 fires]
SINGLE_STEP

→

HandlerHwBp
Restore RSP

→

ret executes
Returns to wrapper

→

wrpNtXxx()
Returns NTSTATUS

Memory Layout During Execution

Understanding the state of RSP at each phase is critical to understanding the full technique:

Stack State at Each Phase

Phase 2: Syscall BP

RSP → wrapper stack

Args 5+ on stack

Return to wrpNtXxx

Phase 3: After Swap

RSP → ntdll frame

user32 frames

MessageBoxW frames

Args 5+ copied here

Phase 4: After Return

RSP → wrapper stack

Return to wrpNtXxx

RAX = NTSTATUS

The Complete Wrapped Syscall List

LayeredSyscall wraps approximately 31 native API functions. Here is a representative subset grouped by category, showing their argument counts and whether they require extended argument copying:

Process & Thread

Function	Args	ExtendedArgs
`NtCreateUserProcess`	11	TRUE
`NtOpenProcess`	4	FALSE
`NtTerminateProcess`	2	FALSE
`NtCreateThreadEx`	11	TRUE
`NtOpenThread`	4	FALSE
`NtResumeThread`	2	FALSE
`NtSuspendThread`	2	FALSE

Memory

Function	Args	ExtendedArgs
`NtAllocateVirtualMemory`	6	TRUE
`NtProtectVirtualMemory`	5	TRUE
`NtFreeVirtualMemory`	4	FALSE
`NtWriteVirtualMemory`	5	TRUE
`NtReadVirtualMemory`	5	TRUE
`NtMapViewOfSection`	10	TRUE
`NtUnmapViewOfSection`	2	FALSE

Section, Query & Token

Function	Args	ExtendedArgs
`NtCreateSection`	7	TRUE
`NtQueryInformationProcess`	5	TRUE
`NtQuerySystemInformation`	4	FALSE
`NtQueryVirtualMemory`	6	TRUE
`NtOpenProcessToken`	3	FALSE
`NtDuplicateToken`	6	TRUE
`NtAdjustPrivilegesToken`	6	TRUE

Handle & Object

Function	Args	ExtendedArgs
`NtClose`	1	FALSE
`NtDuplicateObject`	7	TRUE
`NtWaitForSingleObject`	3	FALSE

Notable Absence: NtSetContextThread

NtSetContextThread cannot be wrapped because it modifies thread context — including the debug registers that LayeredSyscall depends on. Wrapping it would create a circular dependency: the hardware breakpoints need to be active to intercept the syscall, but the syscall itself would modify those same breakpoints.

← Prev: Call Stack Construction Next: Full Chain & Detection →

Module 7: Argument Marshalling & Syscall Execution

The Critical Moment

The Context Swap

Step-by-Step Breakdown

Context Swap Visualized

Syscall Emulation

Why R10 = RCX?

Extended Arguments (5th through 12th)

Argument Offset Table

Why 0x28 for the 5th Argument?

Why Not Always Copy?

Clearing the Trap Flag

What Happens Next

Phase 4: Clean Return via Dr1

Why Restore RSP Here?

Clean Return Flow

Memory Layout During Execution

Stack State at Each Phase

The Complete Wrapped Syscall List

Process & Thread

Memory

Section, Query & Token

Handle & Object

Notable Absence: NtSetContextThread

Module 7 Quiz: Argument Marshalling