Module 6: Call Stack Construction via Trap Flag
Real call stacks from real API calls — no fabrication, no forgery, just CPU abuse.
The Novel Insight
Previous call stack spoofing techniques (like WithSecure's approach) fabricate fake stack frames by parsing UNWIND_CODE structures and constructing synthetic return addresses. LayeredSyscall takes a radically different approach: it calls a legitimate Windows API (MessageBoxW), lets the OS build a genuine call stack through user32.dll and into ntdll.dll, then hijacks that stack at the right moment. The call stack is real because a real API was called.
The Problem: How to Get a Legitimate Call Stack
EDR products and ETW telemetry examine the call stack at the moment a syscall enters the kernel. A legitimate call stack looks like this:
Call Stackntdll!NtCreateUserProcess + 0x14 <-- syscall here
ntdll!SomeInternalFunction + 0x42
user32!CreateProcessInternalW + 0x1A3
kernel32!CreateProcessW + 0x66
myapp.exe!main + 0x55
But when an offensive tool issues a direct or indirect syscall, the stack looks anomalous:
Anomalous Stackntdll!NtCreateUserProcess + 0x14 <-- syscall here
myapp.exe!wrpNtCreateUserProcess + 0x30 <-- suspicious!
myapp.exe!main + 0x55
The jump from myapp.exe directly to ntdll without intermediate Windows DLL frames is a strong signal of a direct/indirect syscall. EDR vendors are increasingly flagging this pattern.
Fabricated vs. Genuine Stacks
Other tools construct fake frames by manually pushing return addresses onto the stack. These frames point to real code, but the execution never actually passed through them. Advanced detections can verify stack consistency using unwinding metadata. LayeredSyscall's approach is immune to such checks because the frames were created by actual execution.
Phase 2: Syscall Breakpoint Fires
After AddHwBp installs hardware breakpoints and the wrapper calls the real Nt* function, execution proceeds through the ntdll stub until it reaches the syscall instruction. The Dr0 hardware breakpoint fires, generating an EXCEPTION_SINGLE_STEP exception. This is where HandlerHwBp takes over.
C++// Inside HandlerHwBp - Phase 2: Syscall breakpoint
if (ExceptionInfo->ContextRecord->Rip ==
SyscallEntryAddr + OPCODE_SYSCALL_OFF)
{
// 1. Disable Dr0 breakpoint (clear bit 0 in Dr7)
ExceptionInfo->ContextRecord->Dr7 &= ~(1 << 0);
// 2. Save the ENTIRE CPU context
// This preserves all syscall arguments in registers + stack
memcpy(&SavedContext,
ExceptionInfo->ContextRecord,
sizeof(CONTEXT));
// 3. Redirect execution to the demo function (MessageBoxW)
ExceptionInfo->ContextRecord->Rip = (ULONG_PTR)demofunction;
// 4. Enable the Trap Flag for single-step tracing
ExceptionInfo->ContextRecord->EFlags |= 0x100;
return EXCEPTION_CONTINUE_EXECUTION;
}
What Each Step Accomplishes
1. Disable Dr0
The syscall breakpoint has served its purpose. We clear it so it does not fire again when we eventually execute the real syscall instruction later. Only the Dr1 (ret) breakpoint remains active for the clean return phase.
2. Save Full Context
The memcpy of the entire CONTEXT structure captures everything: all general-purpose registers (which hold the first 4 syscall arguments in RCX, RDX, R8, R9), RSP (which points to the stack containing arguments 5+), RFLAGS, and the segment registers. This snapshot is the blueprint for the eventual real syscall.
3. Redirect to demofunction()
By setting RIP to demofunction (default: MessageBoxW), execution does not enter the kernel. Instead, it begins executing a completely legitimate Windows API. This API will build genuine call stack frames as it works through user32.dll and eventually calls into ntdll.dll.
4. Enable the Trap Flag
Setting bit 8 of EFlags (0x100) activates the CPU Trap Flag. After every single instruction, the CPU raises EXCEPTION_SINGLE_STEP. This gives HandlerHwBp the ability to monitor instruction-by-instruction execution through the legitimate API chain.
The CPU Trap Flag (TF)
EFlags Bit 8 — The Single-Step Mechanism
The Trap Flag is a single bit in the RFLAGS register (bit 8, value 0x100). When set, the CPU raises a debug exception (EXCEPTION_SINGLE_STEP, code 0x80000004) after executing every single instruction. The CPU automatically clears the TF after raising the exception, so the handler must re-set it if continued tracing is desired.
| Property | Detail |
|---|---|
| Register | RFLAGS (EFlags in CONTEXT) |
| Bit Position | Bit 8 |
| Bitmask | 0x100 |
| Exception Generated | EXCEPTION_SINGLE_STEP (0x80000004) |
| Auto-Cleared | Yes — CPU clears TF after each exception |
| Normal Use | Debuggers for single-step execution |
| LayeredSyscall Use | Monitor every instruction through a legitimate API chain to find the right moment to hijack |
Performance Impact
While the trap flag is active, every instruction generates an exception. The VEH handler runs after each one. A typical API call chain from MessageBoxW into ntdll may execute thousands of instructions. This means hundreds or thousands of exceptions per wrapped syscall. The overhead is significant but acceptable for the evasion benefit in targeted operations.
The demofunction() Redirect
After redirection, execution flows through a genuine Windows API call chain. The default demo function is MessageBoxW, which traverses multiple DLL layers before eventually reaching ntdll:
Legitimate Execution Chain
user32.dll
user32/win32u
win32u.dll
ntdll.dll
As execution passes through each function, the CPU pushes return addresses and creates stack frames. These are genuine frames because the code is actually executing. By the time execution reaches ntdll, the call stack looks exactly like a legitimate MessageBoxW call chain.
Why MessageBoxW?
MessageBoxW is the default demofunction() because it naturally calls deep into the Windows subsystem and eventually reaches ntdll. However, this is configurable. The key requirement is that the demo function must eventually call into ntdll.dll, creating a chain of legitimate frames. MessageBoxW never actually displays because execution is hijacked before the window system call completes.
The MessageBox Never Appears
Even though MessageBoxW is called, the actual display never happens. The trap flag causes an exception after every instruction. Once the three conditions (explained below) are met inside ntdll, execution is redirected to the real syscall. MessageBoxW's internal state is abandoned. This is safe because the function was only used to build stack frames, not for its actual purpose.
Phase 3: The Three-Condition Algorithm
This is the core of the LayeredSyscall technique. While the trap flag traces through the legitimate API chain, HandlerHwBp monitors every instruction, checking three conditions that must be met before the real syscall can execute.
Waiting for ntdll
Before any conditions are checked, the handler verifies that execution is inside ntdll.dll:
C++// Check if RIP is within ntdll address range
if (ExceptionInfo->ContextRecord->Rip >= NtdllInfo.DllBaseAddress &&
ExceptionInfo->ContextRecord->Rip <= NtdllInfo.DllEndAddress)
{
// Inside ntdll - start checking conditions
// ...
}
else {
// Not in ntdll yet - re-enable trap flag and continue
ExceptionInfo->ContextRecord->EFlags |= 0x100;
return EXCEPTION_CONTINUE_EXECUTION;
}
Until execution enters ntdll, the handler simply re-enables the trap flag (it is auto-cleared by the CPU) and continues. This loops thousands of times as execution traverses user32.dll and other intermediate DLLs.
Condition 1: Find sub rsp, X Where X >= 0x58
IsSubRsp = 0 → 1
Once inside ntdll, the handler scans forward from the current RIP (up to 80 bytes) looking for the opcode 0xEC8348, which encodes sub rsp, imm8. The immediate value must be >= 0x58 (88 bytes decimal).
C++// Condition 1: Find a function prologue with sufficient stack space
for (int i = 0; i < 80; i++) {
ULONG_PTR scan = ExceptionInfo->ContextRecord->Rip + i;
// Check for 'ret' - if we hit ret first, this function is too small
if (*(BYTE*)scan == 0xC3) {
IsSubRsp = 0; // Reset - keep looking
break;
}
// Check for 'sub rsp, imm8' where imm8 >= 0x58
if ((*(UINT32*)scan & 0x00FFFFFF) == 0xEC8348) {
BYTE stackSize = *(BYTE*)(scan + 3);
if (stackSize >= 0x58) {
IsSubRsp = 1; // Condition 1 met!
break;
}
}
}
Why 0x58? This is 88 bytes, which provides room for the 4 home space slots (32 bytes) plus 8 additional stack arguments (64 bytes) = 12 arguments total. This is the maximum number of arguments LayeredSyscall supports.
Condition 2: Find a call Instruction
IsSubRsp = 1 → 2
With IsSubRsp == 1, the handler now monitors each instruction looking for a call opcode (0xE8, relative near call). If a ret (0xC3) is encountered before a call, the function was too small and the state machine resets to 0.
C++// Condition 2: Find a 'call' instruction within the function
if (IsSubRsp == 1) {
BYTE opcode = *(BYTE*)(ExceptionInfo->ContextRecord->Rip);
if (opcode == 0xC3) {
// Hit 'ret' before 'call' - function too shallow, reset
IsSubRsp = 0;
}
else if (opcode == 0xE8) {
// Found a 'call' instruction!
IsSubRsp = 2; // Condition 2 met!
}
}
Condition 3: Execute (IsSubRsp == 2)
All Conditions Met — Ready to Swap
When IsSubRsp reaches 2, we are positioned inside a legitimate ntdll function that has:
- A stack frame with at least 0x58 bytes (room for 12 arguments)
- A
callinstruction (deep enough in the function for a complete frame) - Execution that arrived here through genuine API calls (MessageBoxW → user32 → ntdll)
The call stack above us is genuine. This is the moment to swap in the real syscall arguments and execute.
State Machine Diagram
Three-Condition State Machine
Searching for
sub rsp >= 0x58
Looking for
call (0xE8)
Execute!
Context swap
ret (0xC3) is found before the target opcode, state resets to 0
Full Trace Flow with Conditions
user32.dll
Every instruction
Range check
Condition 1
Condition 2
Condition 3
Why These Three Conditions?
sub rsp >= 0x58 — Stack Space Guarantee
The value 0x58 (88 bytes) ensures the current stack frame has room for:
| Component | Bytes | Purpose |
|---|---|---|
| Shadow space / Home area | 32 (0x20) | Required by x64 calling convention for RCX, RDX, R8, R9 |
| Arguments 5-12 on stack | 56 (0x38) | Up to 8 additional stack arguments (7 slots at 8 bytes each) |
| Alignment | Variable | Stack must be 16-byte aligned before call |
With 88 bytes or more, we can safely copy up to 12 arguments (the maximum supported by any wrapped syscall) onto this stack frame without corrupting adjacent frames.
call Instruction — Frame Depth
Finding a call instruction means the function we are in is making sub-calls. This guarantees we are deep enough in the function that there is a proper stack frame established. If the function only did simple register operations and returned immediately (hitting ret before call), the stack frame would be too shallow and the call stack would not look convincing.
Inside ntdll — Legitimate Origin
For the call stack to be convincing, the syscall must appear to originate from within ntdll.dll. If we hijacked execution in user32.dll, the call stack would show a syscall from user32 (unusual). By waiting until execution is inside ntdll, the final stack frame chain is: ntdll ← user32 ← MessageBoxW — exactly what a legitimate API call looks like.
Module 6 Quiz: Call Stack Construction
Q1: What does the CPU Trap Flag (EFlags bit 8) do when set?
Q2: What are the three conditions that must be met before the context swap occurs?
sub rsp, imm8 with the immediate >= 0x58 must be found (sufficient stack space), and (3) a call instruction must be found (deep enough frame). Together they ensure a legitimate, spacious stack frame inside ntdll.Q3: Why must the sub rsp immediate be at least 0x58 (88 bytes)?