Difficulty: Advanced

Module 7: Argument Marshalling & Syscall Execution

Registers from the past, a stack from the present — the Frankenstein context that fools kernel telemetry.

The Critical Moment

When the three-condition algorithm (Module 6) reaches IsSubRsp == 2, we hold two things: a saved context from the original syscall (containing all the real arguments) and a live stack with genuine call frames from the MessageBoxW chain. The context swap merges these into a single execution state: real arguments + legitimate call stack. This is the moment everything comes together.

The Context Swap

When IsSubRsp reaches 2, the handler performs the most critical operation in the entire technique — the context swap. This is a carefully ordered sequence that preserves exactly what we need from each context.

C++// IsSubRsp == 2: All three conditions met
if (IsSubRsp == 2) {
    // Step 1: Save the CURRENT RSP (legitimate call stack)
    ULONG_PTR TempRsp = ExceptionInfo->ContextRecord->Rsp;

    // Step 2: Restore the SAVED context (real syscall arguments)
    //         This overwrites ALL registers including RSP
    memcpy(ExceptionInfo->ContextRecord,
           &SavedContext,
           sizeof(CONTEXT));

    // Step 3: Replace RSP with the legitimate stack pointer
    //         This keeps the genuine call frames from MessageBoxW chain
    ExceptionInfo->ContextRecord->Rsp = TempRsp;

    // ... (syscall emulation and argument copying follow)
}

Step-by-Step Breakdown

StepWhat ChangesWhy
Save TempRspCopy current RSP to a local variableThe current RSP points to the legitimate call stack (MessageBoxW → user32 → ntdll). We must not lose this.
Restore SavedContextOverwrite the entire CONTEXT with the saved snapshotThis restores RCX, RDX, R8, R9 (first 4 arguments), RAX, and all other registers to their values at the original syscall breakpoint
Replace RSPOverwrite RSP with TempRspThe saved RSP pointed to the wrapper function's stack. We replace it with the legitimate stack so the call frames above us are from the MessageBoxW chain.

Context Swap Visualized

SavedContext (from Phase 2)

RCX = arg1 (real)
RDX = arg2 (real)
R8 = arg3 (real)
R9 = arg4 (real)
RSP = wrapper stack (discard)

Current Context (from trace)

RCX = MessageBoxW junk (discard)
RDX = MessageBoxW junk (discard)
R8 = MessageBoxW junk (discard)
R9 = MessageBoxW junk (discard)
RSP = legitimate stack (keep!)

Green = kept, Red = discarded. The final state has real arguments + legitimate stack.

Syscall Emulation

After the context swap restores the real arguments, the handler must set up the CPU state exactly as if the ntdll syscall stub had executed normally. This means emulating the two instructions that the stub performs before the syscall opcode:

x86-64 ASM;; Normal ntdll stub (what we are emulating):
mov r10, rcx        ; Save first argument (syscall clobbers RCX)
mov eax, <SSN>      ; Load System Service Number
syscall              ; Enter kernel
C++// Emulate: mov r10, rcx
ExceptionInfo->ContextRecord->R10 =
    ExceptionInfo->ContextRecord->Rcx;

// Emulate: mov eax, SSN
ExceptionInfo->ContextRecord->Rax = SyscallNo;

// Point RIP directly at the syscall instruction
ExceptionInfo->ContextRecord->Rip =
    SyscallEntryAddr + OPCODE_SYSCALL_OFF;

Why R10 = RCX?

The x64 syscall instruction is destructive: it saves the return address in RCX (overwriting whatever was there) and saves RFLAGS in R11. This means the first argument (originally in RCX per the Windows x64 calling convention) would be lost. The ntdll stub copies RCX to R10 before the syscall so the kernel can read the first argument from R10 instead. LayeredSyscall must replicate this behavior.

RegisterValue SetPurpose
R10Copy of RCX (first argument)Kernel reads arg1 from R10 since syscall clobbers RCX
RAXSystem Service Number (SSN)Kernel uses RAX to index the System Service Descriptor Table (SSDT)
RIPSyscallEntryAddr + OPCODE_SYSCALL_OFFExecution resumes directly at the syscall instruction inside ntdll

Extended Arguments (5th through 12th)

The x64 Windows calling convention passes the first four arguments in registers (RCX, RDX, R8, R9). Any additional arguments go on the stack. The context swap restored the registers but used the legitimate stack (TempRsp). That stack does not have the original arguments 5+. They must be copied from the saved stack.

C++if (ExtendedArgs) {
    ULONG_PTR Rsp      = ExceptionInfo->ContextRecord->Rsp;  // Legitimate stack
    ULONG_PTR SavedRsp = SavedContext.Rsp;                     // Original stack

    // Copy arguments 5 through 12 from saved stack to legitimate stack
    *(ULONG_PTR*)(Rsp + FIFTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + FIFTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + SIXTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + SIXTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + SEVENTH_ARGUMENT)   = *(ULONG_PTR*)(SavedRsp + SEVENTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + EIGHTH_ARGUMENT)    = *(ULONG_PTR*)(SavedRsp + EIGHTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + NINTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + NINTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + TENTH_ARGUMENT)     = *(ULONG_PTR*)(SavedRsp + TENTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + ELEVENTH_ARGUMENT)  = *(ULONG_PTR*)(SavedRsp + ELEVENTH_ARGUMENT);
    *(ULONG_PTR*)(Rsp + TWELVETH_ARGUMENT)  = *(ULONG_PTR*)(SavedRsp + TWELVETH_ARGUMENT);
}

Argument Offset Table

ArgumentPassing MethodStack OffsetHex
1st (arg1)RCX register
2nd (arg2)RDX register
3rd (arg3)R8 register
4th (arg4)R9 register
5thStackRSP + 0x28FIFTH_ARGUMENT
6thStackRSP + 0x30SIXTH_ARGUMENT
7thStackRSP + 0x38SEVENTH_ARGUMENT
8thStackRSP + 0x40EIGHTH_ARGUMENT
9thStackRSP + 0x48NINTH_ARGUMENT
10thStackRSP + 0x50TENTH_ARGUMENT
11thStackRSP + 0x58ELEVENTH_ARGUMENT
12thStackRSP + 0x60TWELVETH_ARGUMENT

Why 0x28 for the 5th Argument?

On x64 Windows, the caller allocates 32 bytes (0x20) of shadow space (also called "home space") on the stack even for the first 4 register arguments. The return address occupies another 8 bytes at RSP + 0x00. So the 5th argument starts at RSP + 0x28 (shadow space 0x20 + return address 0x08 = 0x28). Each subsequent argument is 8 bytes further.

Why Not Always Copy?

Functions with 4 or fewer arguments (like NtClose with 1 argument) do not use stack arguments. The ExtendedArgs flag avoids unnecessary memory writes for these simple functions. This is both a performance optimization and a safety measure — writing to stack locations that are not expected to hold arguments could corrupt other data.

Clearing the Trap Flag

After the context swap and syscall emulation are complete, the handler must clear the Trap Flag. If it remains set, every instruction of the real syscall stub (and potentially kernel code) would trigger SINGLE_STEP exceptions, which would be catastrophic.

C++// Clear the Trap Flag - stop single-stepping
ExceptionInfo->ContextRecord->EFlags &= ~0x100;

// Reset the state machine for the next wrapped syscall
IsSubRsp = 0;

return EXCEPTION_CONTINUE_EXECUTION;
// Execution resumes at the syscall instruction with:
//   - Real arguments in registers (RCX, RDX, R8, R9)
//   - R10 = RCX (syscall convention)
//   - RAX = SSN
//   - RSP pointing to legitimate call stack with args 5+ copied
//   - RIP at the syscall instruction inside ntdll

What Happens Next

The syscall instruction executes inside ntdll.dll memory, with genuine call stack frames above it, correct arguments, and the proper SSN in RAX. The kernel processes the request normally. To the kernel's ETW telemetry, this looks like a completely legitimate system call from a standard API chain.

Phase 4: Clean Return via Dr1

After the kernel completes the syscall, execution returns to the ret instruction in the ntdll stub (immediately after the syscall instruction). The Dr1 hardware breakpoint fires, generating another EXCEPTION_SINGLE_STEP.

C++// Inside HandlerHwBp - Phase 4: Return breakpoint
if (ExceptionInfo->ContextRecord->Rip ==
        SyscallEntryAddr + OPCODE_SYSCALL_RET_OFF)
{
    // 1. Disable Dr1 breakpoint (clear bit 2 in Dr7)
    ExceptionInfo->ContextRecord->Dr7 &= ~(1 << 2);

    // 2. Restore the original RSP from SavedContext
    //    This points back to the wrapper function's stack frame
    ExceptionInfo->ContextRecord->Rsp = SavedContext.Rsp;

    // 3. Continue execution - the 'ret' instruction will now
    //    return to the wrapper function as if nothing happened
    return EXCEPTION_CONTINUE_EXECUTION;
}

Why Restore RSP Here?

During the syscall, RSP pointed to the legitimate stack (from the MessageBoxW chain). But the wrapper function expects to resume with its own stack frame. By restoring SavedContext.Rsp, the ret instruction pops the correct return address and execution returns to the wrapper function. The return value (NTSTATUS) is in RAX, exactly where the wrapper expects it.

Clean Return Flow

Kernel returns
RAX = NTSTATUS
ret [Dr1 fires]
SINGLE_STEP
HandlerHwBp
Restore RSP
ret executes
Returns to wrapper
wrpNtXxx()
Returns NTSTATUS

Memory Layout During Execution

Understanding the state of RSP at each phase is critical to understanding the full technique:

Stack State at Each Phase

Phase 2: Syscall BP

RSP → wrapper stack
Args 5+ on stack
Return to wrpNtXxx

Phase 3: After Swap

RSP → ntdll frame
user32 frames
MessageBoxW frames
Args 5+ copied here

Phase 4: After Return

RSP → wrapper stack
Return to wrpNtXxx
RAX = NTSTATUS

The Complete Wrapped Syscall List

LayeredSyscall wraps approximately 31 native API functions. Here is a representative subset grouped by category, showing their argument counts and whether they require extended argument copying:

Process & Thread

FunctionArgsExtendedArgs
NtCreateUserProcess11TRUE
NtOpenProcess4FALSE
NtTerminateProcess2FALSE
NtCreateThreadEx11TRUE
NtOpenThread4FALSE
NtResumeThread2FALSE
NtSuspendThread2FALSE

Memory

FunctionArgsExtendedArgs
NtAllocateVirtualMemory6TRUE
NtProtectVirtualMemory5TRUE
NtFreeVirtualMemory4FALSE
NtWriteVirtualMemory5TRUE
NtReadVirtualMemory5TRUE
NtMapViewOfSection10TRUE
NtUnmapViewOfSection2FALSE

Section, Query & Token

FunctionArgsExtendedArgs
NtCreateSection7TRUE
NtQueryInformationProcess5TRUE
NtQuerySystemInformation4FALSE
NtQueryVirtualMemory6TRUE
NtOpenProcessToken3FALSE
NtDuplicateToken6TRUE
NtAdjustPrivilegesToken6TRUE

Handle & Object

FunctionArgsExtendedArgs
NtClose1FALSE
NtDuplicateObject7TRUE
NtWaitForSingleObject3FALSE

Notable Absence: NtSetContextThread

NtSetContextThread cannot be wrapped because it modifies thread context — including the debug registers that LayeredSyscall depends on. Wrapping it would create a circular dependency: the hardware breakpoints need to be active to intercept the syscall, but the syscall itself would modify those same breakpoints.

Module 7 Quiz: Argument Marshalling

Q1: Why does the syscall emulation set R10 = RCX?

The x64 syscall instruction saves RIP into RCX and RFLAGS into R11. This destroys the first argument (which was in RCX). The ntdll stub copies RCX to R10 before issuing syscall so the kernel can read the first argument from R10. LayeredSyscall emulates this behavior.

Q2: At what stack offset does the 5th argument begin, and why?

The 5th argument is at RSP + 0x28. The first 0x20 bytes (32 bytes) are the shadow/home space reserved for the 4 register arguments (even though they are in registers, this space is always allocated). Then 0x08 bytes for the return address. So 0x20 + 0x08 = 0x28.

Q3: What does the Dr1 breakpoint handler do after the syscall returns from kernel mode?

The Dr1 handler clears the Dr1 breakpoint (bit 2 in Dr7) and restores RSP to SavedContext.Rsp (the wrapper function's stack pointer). When the ret instruction executes, it pops the return address from the original stack and returns to the wrapper function with the NTSTATUS result in RAX.