Difficulty: Advanced

Module 6: Call Stack Construction via Trap Flag

Real call stacks from real API calls — no fabrication, no forgery, just CPU abuse.

The Novel Insight

Previous call stack spoofing techniques (like WithSecure's approach) fabricate fake stack frames by parsing UNWIND_CODE structures and constructing synthetic return addresses. LayeredSyscall takes a radically different approach: it calls a legitimate Windows API (MessageBoxW), lets the OS build a genuine call stack through user32.dll and into ntdll.dll, then hijacks that stack at the right moment. The call stack is real because a real API was called.

The Problem: How to Get a Legitimate Call Stack

EDR products and ETW telemetry examine the call stack at the moment a syscall enters the kernel. A legitimate call stack looks like this:

Call Stackntdll!NtCreateUserProcess + 0x14       <-- syscall here
ntdll!SomeInternalFunction + 0x42
user32!CreateProcessInternalW + 0x1A3
kernel32!CreateProcessW + 0x66
myapp.exe!main + 0x55

But when an offensive tool issues a direct or indirect syscall, the stack looks anomalous:

Anomalous Stackntdll!NtCreateUserProcess + 0x14       <-- syscall here
myapp.exe!wrpNtCreateUserProcess + 0x30  <-- suspicious!
myapp.exe!main + 0x55

The jump from myapp.exe directly to ntdll without intermediate Windows DLL frames is a strong signal of a direct/indirect syscall. EDR vendors are increasingly flagging this pattern.

Fabricated vs. Genuine Stacks

Other tools construct fake frames by manually pushing return addresses onto the stack. These frames point to real code, but the execution never actually passed through them. Advanced detections can verify stack consistency using unwinding metadata. LayeredSyscall's approach is immune to such checks because the frames were created by actual execution.

Phase 2: Syscall Breakpoint Fires

After AddHwBp installs hardware breakpoints and the wrapper calls the real Nt* function, execution proceeds through the ntdll stub until it reaches the syscall instruction. The Dr0 hardware breakpoint fires, generating an EXCEPTION_SINGLE_STEP exception. This is where HandlerHwBp takes over.

C++// Inside HandlerHwBp - Phase 2: Syscall breakpoint
if (ExceptionInfo->ContextRecord->Rip ==
        SyscallEntryAddr + OPCODE_SYSCALL_OFF)
{
    // 1. Disable Dr0 breakpoint (clear bit 0 in Dr7)
    ExceptionInfo->ContextRecord->Dr7 &= ~(1 << 0);

    // 2. Save the ENTIRE CPU context
    //    This preserves all syscall arguments in registers + stack
    memcpy(&SavedContext,
           ExceptionInfo->ContextRecord,
           sizeof(CONTEXT));

    // 3. Redirect execution to the demo function (MessageBoxW)
    ExceptionInfo->ContextRecord->Rip = (ULONG_PTR)demofunction;

    // 4. Enable the Trap Flag for single-step tracing
    ExceptionInfo->ContextRecord->EFlags |= 0x100;

    return EXCEPTION_CONTINUE_EXECUTION;
}

What Each Step Accomplishes

1. Disable Dr0

The syscall breakpoint has served its purpose. We clear it so it does not fire again when we eventually execute the real syscall instruction later. Only the Dr1 (ret) breakpoint remains active for the clean return phase.

2. Save Full Context

The memcpy of the entire CONTEXT structure captures everything: all general-purpose registers (which hold the first 4 syscall arguments in RCX, RDX, R8, R9), RSP (which points to the stack containing arguments 5+), RFLAGS, and the segment registers. This snapshot is the blueprint for the eventual real syscall.

3. Redirect to demofunction()

By setting RIP to demofunction (default: MessageBoxW), execution does not enter the kernel. Instead, it begins executing a completely legitimate Windows API. This API will build genuine call stack frames as it works through user32.dll and eventually calls into ntdll.dll.

4. Enable the Trap Flag

Setting bit 8 of EFlags (0x100) activates the CPU Trap Flag. After every single instruction, the CPU raises EXCEPTION_SINGLE_STEP. This gives HandlerHwBp the ability to monitor instruction-by-instruction execution through the legitimate API chain.

The CPU Trap Flag (TF)

EFlags Bit 8 — The Single-Step Mechanism

The Trap Flag is a single bit in the RFLAGS register (bit 8, value 0x100). When set, the CPU raises a debug exception (EXCEPTION_SINGLE_STEP, code 0x80000004) after executing every single instruction. The CPU automatically clears the TF after raising the exception, so the handler must re-set it if continued tracing is desired.

PropertyDetail
RegisterRFLAGS (EFlags in CONTEXT)
Bit PositionBit 8
Bitmask0x100
Exception GeneratedEXCEPTION_SINGLE_STEP (0x80000004)
Auto-ClearedYes — CPU clears TF after each exception
Normal UseDebuggers for single-step execution
LayeredSyscall UseMonitor every instruction through a legitimate API chain to find the right moment to hijack

Performance Impact

While the trap flag is active, every instruction generates an exception. The VEH handler runs after each one. A typical API call chain from MessageBoxW into ntdll may execute thousands of instructions. This means hundreds or thousands of exceptions per wrapped syscall. The overhead is significant but acceptable for the evasion benefit in targeted operations.

The demofunction() Redirect

After redirection, execution flows through a genuine Windows API call chain. The default demo function is MessageBoxW, which traverses multiple DLL layers before eventually reaching ntdll:

Legitimate Execution Chain

MessageBoxW
user32.dll
Internal Win32
user32/win32u
NtUser* calls
win32u.dll
ntdll internals
ntdll.dll

As execution passes through each function, the CPU pushes return addresses and creates stack frames. These are genuine frames because the code is actually executing. By the time execution reaches ntdll, the call stack looks exactly like a legitimate MessageBoxW call chain.

Why MessageBoxW?

MessageBoxW is the default demofunction() because it naturally calls deep into the Windows subsystem and eventually reaches ntdll. However, this is configurable. The key requirement is that the demo function must eventually call into ntdll.dll, creating a chain of legitimate frames. MessageBoxW never actually displays because execution is hijacked before the window system call completes.

The MessageBox Never Appears

Even though MessageBoxW is called, the actual display never happens. The trap flag causes an exception after every instruction. Once the three conditions (explained below) are met inside ntdll, execution is redirected to the real syscall. MessageBoxW's internal state is abandoned. This is safe because the function was only used to build stack frames, not for its actual purpose.

Phase 3: The Three-Condition Algorithm

This is the core of the LayeredSyscall technique. While the trap flag traces through the legitimate API chain, HandlerHwBp monitors every instruction, checking three conditions that must be met before the real syscall can execute.

Waiting for ntdll

Before any conditions are checked, the handler verifies that execution is inside ntdll.dll:

C++// Check if RIP is within ntdll address range
if (ExceptionInfo->ContextRecord->Rip >= NtdllInfo.DllBaseAddress &&
    ExceptionInfo->ContextRecord->Rip <= NtdllInfo.DllEndAddress)
{
    // Inside ntdll - start checking conditions
    // ...
}
else {
    // Not in ntdll yet - re-enable trap flag and continue
    ExceptionInfo->ContextRecord->EFlags |= 0x100;
    return EXCEPTION_CONTINUE_EXECUTION;
}

Until execution enters ntdll, the handler simply re-enables the trap flag (it is auto-cleared by the CPU) and continues. This loops thousands of times as execution traverses user32.dll and other intermediate DLLs.

Condition 1: Find sub rsp, X Where X >= 0x58

IsSubRsp = 0 → 1

Once inside ntdll, the handler scans forward from the current RIP (up to 80 bytes) looking for the opcode 0xEC8348, which encodes sub rsp, imm8. The immediate value must be >= 0x58 (88 bytes decimal).

C++// Condition 1: Find a function prologue with sufficient stack space
for (int i = 0; i < 80; i++) {
    ULONG_PTR scan = ExceptionInfo->ContextRecord->Rip + i;

    // Check for 'ret' - if we hit ret first, this function is too small
    if (*(BYTE*)scan == 0xC3) {
        IsSubRsp = 0;  // Reset - keep looking
        break;
    }

    // Check for 'sub rsp, imm8' where imm8 >= 0x58
    if ((*(UINT32*)scan & 0x00FFFFFF) == 0xEC8348) {
        BYTE stackSize = *(BYTE*)(scan + 3);
        if (stackSize >= 0x58) {
            IsSubRsp = 1;  // Condition 1 met!
            break;
        }
    }
}

Why 0x58? This is 88 bytes, which provides room for the 4 home space slots (32 bytes) plus 8 additional stack arguments (64 bytes) = 12 arguments total. This is the maximum number of arguments LayeredSyscall supports.

Condition 2: Find a call Instruction

IsSubRsp = 1 → 2

With IsSubRsp == 1, the handler now monitors each instruction looking for a call opcode (0xE8, relative near call). If a ret (0xC3) is encountered before a call, the function was too small and the state machine resets to 0.

C++// Condition 2: Find a 'call' instruction within the function
if (IsSubRsp == 1) {
    BYTE opcode = *(BYTE*)(ExceptionInfo->ContextRecord->Rip);

    if (opcode == 0xC3) {
        // Hit 'ret' before 'call' - function too shallow, reset
        IsSubRsp = 0;
    }
    else if (opcode == 0xE8) {
        // Found a 'call' instruction!
        IsSubRsp = 2;  // Condition 2 met!
    }
}

Condition 3: Execute (IsSubRsp == 2)

All Conditions Met — Ready to Swap

When IsSubRsp reaches 2, we are positioned inside a legitimate ntdll function that has:

The call stack above us is genuine. This is the moment to swap in the real syscall arguments and execute.

State Machine Diagram

Three-Condition State Machine

IsSubRsp = 0
Searching for
sub rsp >= 0x58
found →
IsSubRsp = 1
Looking for
call (0xE8)
found →
IsSubRsp = 2
Execute!
Context swap
If ret (0xC3) is found before the target opcode, state resets to 0

Full Trace Flow with Conditions

MessageBoxW
user32.dll
Trap Flag Trace
Every instruction
Enter ntdll?
Range check
yes →
sub rsp >= 0x58?
Condition 1
yes →
call found?
Condition 2
yes →
Context Swap
Condition 3

Why These Three Conditions?

sub rsp >= 0x58 — Stack Space Guarantee

The value 0x58 (88 bytes) ensures the current stack frame has room for:

ComponentBytesPurpose
Shadow space / Home area32 (0x20)Required by x64 calling convention for RCX, RDX, R8, R9
Arguments 5-12 on stack56 (0x38)Up to 8 additional stack arguments (7 slots at 8 bytes each)
AlignmentVariableStack must be 16-byte aligned before call

With 88 bytes or more, we can safely copy up to 12 arguments (the maximum supported by any wrapped syscall) onto this stack frame without corrupting adjacent frames.

call Instruction — Frame Depth

Finding a call instruction means the function we are in is making sub-calls. This guarantees we are deep enough in the function that there is a proper stack frame established. If the function only did simple register operations and returned immediately (hitting ret before call), the stack frame would be too shallow and the call stack would not look convincing.

Inside ntdll — Legitimate Origin

For the call stack to be convincing, the syscall must appear to originate from within ntdll.dll. If we hijacked execution in user32.dll, the call stack would show a syscall from user32 (unusual). By waiting until execution is inside ntdll, the final stack frame chain is: ntdll ← user32 ← MessageBoxW — exactly what a legitimate API call looks like.

Module 6 Quiz: Call Stack Construction

Q1: What does the CPU Trap Flag (EFlags bit 8) do when set?

Correct! The Trap Flag (TF) causes the CPU to raise a debug exception (EXCEPTION_SINGLE_STEP) after executing each instruction. The CPU automatically clears TF after each exception. LayeredSyscall re-sets it in the handler to continue tracing.

Q2: What are the three conditions that must be met before the context swap occurs?

The three conditions are: (1) execution must be inside ntdll.dll (range check), (2) a sub rsp, imm8 with the immediate >= 0x58 must be found (sufficient stack space), and (3) a call instruction must be found (deep enough frame). Together they ensure a legitimate, spacious stack frame inside ntdll.

Q3: Why must the sub rsp immediate be at least 0x58 (88 bytes)?

0x58 (88 bytes) provides 32 bytes of shadow space (required by x64 ABI for the first 4 register arguments) plus 56 bytes for up to 8 additional stack arguments. This covers the maximum 12-argument functions that LayeredSyscall supports (like NtCreateUserProcess with 11 arguments).