Module 6: Decoder Stub Generation
How asmjit emits polymorphic x86-64 decoder stubs with register randomization, operation-specific code generation, and position-independent data access.
Module Objective
Understand how Shoggoth uses asmjit to generate the RC4 decoder stub and the block cipher decoder stub, how register randomization is implemented at the code generation level, how each operation in the block cipher chain maps to x86-64 instructions, and how the stubs locate their encrypted data using RIP-relative addressing.
1. Stub Generation Overview
Shoggoth generates two decoder stubs, one per encryption stage. Each stub is built using a fresh CodeHolder and x86::Assembler instance, with randomly selected registers and junk code interleaved. The generation process follows a consistent pattern:
Stub Generation Pipeline
Assignment
Save registers, locate data
Loop Body
Restore, jump to payload
The critical innovation is that each step involves random choices: which registers hold which values, whether the loop counts up or down, how the loop pointer advances, and where junk instructions are inserted. The result is that two invocations produce functionally equivalent but structurally different machine code.
2. Register Randomization Implementation
Before emitting any instructions, Shoggoth randomly assigns CPU registers to the roles needed by the decoder. The available pool excludes RSP (stack pointer, cannot be clobbered) and may reserve RBP depending on the calling convention needs:
C++// Register assignment for block cipher decoder
// Randomly shuffle and assign roles from the GPR pool
std::vector<x86::Gp> gprPool = {
x86::rax, x86::rbx, x86::rcx, x86::rdx,
x86::rsi, x86::rdi, x86::r8, x86::r9,
x86::r10, x86::r11, x86::r12, x86::r13,
x86::r14, x86::r15
};
std::shuffle(gprPool.begin(), gprPool.end(), rng);
// Assign roles from shuffled pool
x86::Gp regDataPtr = gprPool[0]; // Pointer to encrypted data
x86::Gp regBlockCount = gprPool[1]; // Number of 8-byte blocks
x86::Gp regCurrent = gprPool[2]; // Current block value
x86::Gp regKey = gprPool[3]; // Operation key / temp
x86::Gp regIndex = gprPool[4]; // Block index / loop counter
Since x86-64 encodes register numbers into the REX prefix and ModR/M byte, different register assignments produce different instruction encodings. For example, xor rax, rbx encodes as 48 31 D8 while xor r10, r14 encodes as 4D 31 F2 — completely different bytes for the same logical operation.
| Instruction | Example Encoding (RAX, RBX) | Example Encoding (R10, R14) | Bytes Changed |
|---|---|---|---|
xor regA, regB | 48 31 D8 | 4D 31 F2 | All 3 bytes differ |
mov regA, imm64 | 48 B8 xx xx xx xx xx xx xx xx | 49 BA xx xx xx xx xx xx xx xx | Opcode bytes differ |
add [regA], regB | 48 01 18 | 4D 01 32 | All 3 bytes differ |
3. Block Cipher Decoder Stub
The block cipher decoder processes encrypted data in 8-byte (QWORD) chunks, applying the inverse operation chain to each block. Here is how Shoggoth generates this stub using asmjit:
C++// Simplified block cipher decoder generation
void generateBlockDecoder(x86::Assembler& a,
const std::vector<Operation>& inverseChain,
x86::Gp regPtr, x86::Gp regCount,
x86::Gp regVal, x86::Gp regKey,
size_t encryptedSize) {
Label loopStart = a.newLabel();
Label loopEnd = a.newLabel();
Label dataLabel = a.newLabel();
// Locate encrypted data via RIP-relative LEA
a.lea(regPtr, x86::ptr(x86::rip, dataLabel));
// Set block count
size_t blockCount = encryptedSize / 8;
a.mov(regCount, blockCount);
// === Decryption Loop ===
a.bind(loopStart);
// Load current 8-byte block
a.mov(regVal, x86::qword_ptr(regPtr));
// Apply inverse operations (reversed order from encryption)
for (const auto& op : inverseChain) {
// Insert junk code between operations (see Module 7)
insertGarbageInstructions(a, rng);
switch (op.type) {
case OP_XOR:
a.mov(regKey, op.key); // load key as imm64
a.xor_(regVal, regKey);
break;
case OP_SUB: // inverse of ADD
a.mov(regKey, op.key);
a.sub(regVal, regKey);
break;
case OP_ADD: // inverse of SUB
a.mov(regKey, op.key);
a.add(regVal, regKey);
break;
case OP_ROR: // inverse of ROL
a.ror(regVal, (int)op.key);
break;
case OP_ROL: // inverse of ROR
a.rol(regVal, (int)op.key);
break;
case OP_NOT:
a.not_(regVal);
break;
case OP_NEG:
a.neg(regVal);
break;
case OP_DEC: // inverse of INC
a.dec(regVal);
break;
case OP_INC: // inverse of DEC
a.inc(regVal);
break;
}
}
// Store decrypted block back
a.mov(x86::qword_ptr(regPtr), regVal);
// Advance pointer and loop
a.add(regPtr, 8);
a.dec(regCount);
a.jnz(loopStart);
a.bind(loopEnd);
// Fall through to next stub or jump to payload
a.bind(dataLabel);
// Encrypted data follows here in the final output
}
Each run produces different code because: (1) the registers regPtr, regCount, regVal, regKey are randomly assigned, (2) the operation chain itself varies (different operations, different keys), (3) junk instructions are inserted between each real operation, and (4) the key values embedded as immediate operands are different.
4. RC4 Decoder Stub
The RC4 decoder is more complex because it must implement both the KSA (Key Scheduling Algorithm) and PRGA (Pseudo-Random Generation Algorithm) in x86-64 assembly. The stub needs a 256-byte S-box array, which it allocates on the stack:
C++// Simplified RC4 decoder generation with asmjit
void generateRC4Decoder(x86::Assembler& a,
x86::Gp regI, x86::Gp regJ,
x86::Gp regN, x86::Gp regTemp,
x86::Gp regData, x86::Gp regKeyPtr,
size_t keyLen, size_t dataLen) {
Label ksaLoop = a.newLabel();
Label prgaLoop = a.newLabel();
Label keyData = a.newLabel();
// Allocate 256-byte S-box on stack
a.sub(x86::rsp, 256);
// regSbox points to stack allocation
a.mov(regTemp, x86::rsp);
// === KSA: Initialize S[i] = i ===
Label initLoop = a.newLabel();
a.xor_(regI, regI); // i = 0
a.bind(initLoop);
a.mov(x86::byte_ptr(regTemp, regI), regI.r8Lo());
a.inc(regI);
a.cmp(regI, 256);
a.jne(initLoop);
// === KSA: Permute S using key ===
a.lea(regKeyPtr, x86::ptr(x86::rip, keyData));
a.xor_(regI, regI); // i = 0
a.xor_(regJ, regJ); // j = 0
a.bind(ksaLoop);
// j = (j + S[i] + key[i % keyLen]) & 0xFF
// ... (KSA permutation logic)
// Swap S[i] and S[j]
a.inc(regI);
a.cmp(regI, 256);
a.jne(ksaLoop);
// === PRGA: Generate keystream and XOR with data ===
a.lea(regData, x86::ptr(x86::rip, /* offset to encrypted data */));
a.xor_(regI, regI);
a.xor_(regJ, regJ);
a.mov(regN, dataLen);
a.bind(prgaLoop);
// i = (i + 1) & 0xFF
// j = (j + S[i]) & 0xFF
// Swap S[i], S[j]
// k = S[(S[i] + S[j]) & 0xFF]
// data[n] ^= k
a.dec(regN);
a.jnz(prgaLoop);
// Restore stack and continue
a.add(x86::rsp, 256);
a.bind(keyData);
// RC4 key bytes follow here
}
Stack Usage in PIC Code
The RC4 decoder allocates the 256-byte S-box on the stack. This is safe in PIC code because the stack pointer is always valid regardless of where the code is loaded. However, the stub must carefully balance sub rsp and add rsp to avoid corrupting the stack frame. Shoggoth ensures the RSP adjustment is always correctly matched.
5. RIP-Relative Data Access
Both decoder stubs must locate their data (encryption keys, the encrypted payload) without using absolute addresses. Shoggoth uses two PIC techniques:
5.1 LEA with RIP-Relative Offset
The standard x86-64 approach — lea reg, [rip + offset] computes the address of data relative to the current instruction pointer:
ASM; RIP-relative addressing to locate key data
lea rsi, [rip + key_data] ; rsi = address of key_data
; ...use rsi to read key bytes...
key_data:
db 0xDE, 0xAD, 0xBE, 0xEF ; embedded key
5.2 CALL/POP Technique
An alternative PIC technique sometimes used: CALL pushes the return address (next instruction) onto the stack, then POP retrieves it into a register. This gives the address of the code itself, from which data offsets can be calculated:
ASM; CALL/POP technique for PIC addressing
call get_rip ; push address of 'pop rsi' onto stack
get_rip:
pop rsi ; rsi = address of this instruction
add rsi, offset ; adjust to point at data section
asmjit’s label system handles the RIP-relative approach transparently — you bind a label to the data location and reference it in a lea instruction, and asmjit computes the correct displacement automatically.
6. Equivalent Instruction Substitution in Practice
Beyond register randomization, Shoggoth can substitute individual instructions with equivalent alternatives. Here are concrete examples used in the decoder stubs:
| Operation Needed | Option A | Option B | Option C |
|---|---|---|---|
| Zero a register | xor reg, reg (2-3 bytes) | sub reg, reg (2-3 bytes) | mov reg, 0 (7 bytes for 64-bit) |
| Copy register | mov dst, src | lea dst, [src] | push src; pop dst |
| Increment | inc reg | add reg, 1 | lea reg, [reg+1] |
| Compare to zero | test reg, reg | cmp reg, 0 | or reg, reg |
| Negate | neg reg | not reg; inc reg (two’s complement) | xor reg, -1; add reg, 1 |
The engine randomly selects between available options for common operations, further differentiating the output. Combined with register randomization, even simple setup code like zeroing a counter becomes unpredictable: one run might use xor r14, r14, another sub rbx, rbx, another mov rdi, 0.
7. Putting It All Together
The complete decoder stub generation for a typical two-stage encryption produces output structured like this:
Generated Output Byte Layout
LayoutOffset 0x0000: [Junk Instructions - variable size]
Offset 0x00XX: [Block Cipher Decoder - variable size]
- RIP-relative LEA to locate data
- Block decryption loop with random ops
- Junk between each operation
Offset 0x0YYY: [Junk Instructions - variable size]
Offset 0x0ZZZ: [RC4 Decoder - variable size]
- Stack-allocated 256-byte S-box
- KSA loop with key from embedded data
- PRGA loop XORing with payload
- Stack cleanup
Offset 0x0AAA: [RC4 Key Data - variable size]
Offset 0x0BBB: [Encrypted Payload - fixed size]
Every field marked “variable size” changes between generations. Even the encrypted payload changes because the keys are different. The total output size varies between runs, which is itself a polymorphic property — file size cannot be used as a reliable indicator.
Knowledge Check
Q1: Why does changing the register assignment change the machine code bytes?
Q2: How does the RC4 decoder stub handle the 256-byte S-box allocation?
sub rsp, 256. This is the correct PIC approach — the stack is always available regardless of where the code is loaded, and it avoids the need for API calls (like VirtualAlloc) that would require import resolution. The stub restores RSP when done.Q3: What is the purpose of the lea reg, [rip + label] pattern in the decoder stubs?
lea reg, [rip + offset] instruction computes the address of a data location relative to the current instruction pointer. Since the offset between the code and its data is fixed at generation time, this works regardless of where the code is loaded in memory — which is essential for position-independent code.