Difficulty: Intermediate

Module 6: Decoder Stub Generation

How asmjit emits polymorphic x86-64 decoder stubs with register randomization, operation-specific code generation, and position-independent data access.

Module Objective

Understand how Shoggoth uses asmjit to generate the RC4 decoder stub and the block cipher decoder stub, how register randomization is implemented at the code generation level, how each operation in the block cipher chain maps to x86-64 instructions, and how the stubs locate their encrypted data using RIP-relative addressing.

1. Stub Generation Overview

Shoggoth generates two decoder stubs, one per encryption stage. Each stub is built using a fresh CodeHolder and x86::Assembler instance, with randomly selected registers and junk code interleaved. The generation process follows a consistent pattern:

Stub Generation Pipeline

Random Register
Assignment
Emit Prologue
Save registers, locate data
Emit Decryption
Loop Body
Emit Epilogue
Restore, jump to payload
Extract Raw Bytes

The critical innovation is that each step involves random choices: which registers hold which values, whether the loop counts up or down, how the loop pointer advances, and where junk instructions are inserted. The result is that two invocations produce functionally equivalent but structurally different machine code.

2. Register Randomization Implementation

Before emitting any instructions, Shoggoth randomly assigns CPU registers to the roles needed by the decoder. The available pool excludes RSP (stack pointer, cannot be clobbered) and may reserve RBP depending on the calling convention needs:

C++// Register assignment for block cipher decoder
// Randomly shuffle and assign roles from the GPR pool
std::vector<x86::Gp> gprPool = {
    x86::rax, x86::rbx, x86::rcx, x86::rdx,
    x86::rsi, x86::rdi, x86::r8,  x86::r9,
    x86::r10, x86::r11, x86::r12, x86::r13,
    x86::r14, x86::r15
};

std::shuffle(gprPool.begin(), gprPool.end(), rng);

// Assign roles from shuffled pool
x86::Gp regDataPtr   = gprPool[0];  // Pointer to encrypted data
x86::Gp regBlockCount = gprPool[1]; // Number of 8-byte blocks
x86::Gp regCurrent   = gprPool[2];  // Current block value
x86::Gp regKey       = gprPool[3];  // Operation key / temp
x86::Gp regIndex     = gprPool[4];  // Block index / loop counter

Since x86-64 encodes register numbers into the REX prefix and ModR/M byte, different register assignments produce different instruction encodings. For example, xor rax, rbx encodes as 48 31 D8 while xor r10, r14 encodes as 4D 31 F2 — completely different bytes for the same logical operation.

InstructionExample Encoding (RAX, RBX)Example Encoding (R10, R14)Bytes Changed
xor regA, regB48 31 D84D 31 F2All 3 bytes differ
mov regA, imm6448 B8 xx xx xx xx xx xx xx xx49 BA xx xx xx xx xx xx xx xxOpcode bytes differ
add [regA], regB48 01 184D 01 32All 3 bytes differ

3. Block Cipher Decoder Stub

The block cipher decoder processes encrypted data in 8-byte (QWORD) chunks, applying the inverse operation chain to each block. Here is how Shoggoth generates this stub using asmjit:

C++// Simplified block cipher decoder generation
void generateBlockDecoder(x86::Assembler& a,
                          const std::vector<Operation>& inverseChain,
                          x86::Gp regPtr, x86::Gp regCount,
                          x86::Gp regVal, x86::Gp regKey,
                          size_t encryptedSize) {
    Label loopStart = a.newLabel();
    Label loopEnd   = a.newLabel();
    Label dataLabel = a.newLabel();

    // Locate encrypted data via RIP-relative LEA
    a.lea(regPtr, x86::ptr(x86::rip, dataLabel));

    // Set block count
    size_t blockCount = encryptedSize / 8;
    a.mov(regCount, blockCount);

    // === Decryption Loop ===
    a.bind(loopStart);

    // Load current 8-byte block
    a.mov(regVal, x86::qword_ptr(regPtr));

    // Apply inverse operations (reversed order from encryption)
    for (const auto& op : inverseChain) {
        // Insert junk code between operations (see Module 7)
        insertGarbageInstructions(a, rng);

        switch (op.type) {
            case OP_XOR:
                a.mov(regKey, op.key);    // load key as imm64
                a.xor_(regVal, regKey);
                break;
            case OP_SUB:  // inverse of ADD
                a.mov(regKey, op.key);
                a.sub(regVal, regKey);
                break;
            case OP_ADD:  // inverse of SUB
                a.mov(regKey, op.key);
                a.add(regVal, regKey);
                break;
            case OP_ROR:  // inverse of ROL
                a.ror(regVal, (int)op.key);
                break;
            case OP_ROL:  // inverse of ROR
                a.rol(regVal, (int)op.key);
                break;
            case OP_NOT:
                a.not_(regVal);
                break;
            case OP_NEG:
                a.neg(regVal);
                break;
            case OP_DEC:  // inverse of INC
                a.dec(regVal);
                break;
            case OP_INC:  // inverse of DEC
                a.inc(regVal);
                break;
        }
    }

    // Store decrypted block back
    a.mov(x86::qword_ptr(regPtr), regVal);

    // Advance pointer and loop
    a.add(regPtr, 8);
    a.dec(regCount);
    a.jnz(loopStart);

    a.bind(loopEnd);
    // Fall through to next stub or jump to payload

    a.bind(dataLabel);
    // Encrypted data follows here in the final output
}

Each run produces different code because: (1) the registers regPtr, regCount, regVal, regKey are randomly assigned, (2) the operation chain itself varies (different operations, different keys), (3) junk instructions are inserted between each real operation, and (4) the key values embedded as immediate operands are different.

4. RC4 Decoder Stub

The RC4 decoder is more complex because it must implement both the KSA (Key Scheduling Algorithm) and PRGA (Pseudo-Random Generation Algorithm) in x86-64 assembly. The stub needs a 256-byte S-box array, which it allocates on the stack:

C++// Simplified RC4 decoder generation with asmjit
void generateRC4Decoder(x86::Assembler& a,
                        x86::Gp regI, x86::Gp regJ,
                        x86::Gp regN, x86::Gp regTemp,
                        x86::Gp regData, x86::Gp regKeyPtr,
                        size_t keyLen, size_t dataLen) {
    Label ksaLoop = a.newLabel();
    Label prgaLoop = a.newLabel();
    Label keyData  = a.newLabel();

    // Allocate 256-byte S-box on stack
    a.sub(x86::rsp, 256);
    // regSbox points to stack allocation
    a.mov(regTemp, x86::rsp);

    // === KSA: Initialize S[i] = i ===
    Label initLoop = a.newLabel();
    a.xor_(regI, regI);            // i = 0
    a.bind(initLoop);
    a.mov(x86::byte_ptr(regTemp, regI), regI.r8Lo());
    a.inc(regI);
    a.cmp(regI, 256);
    a.jne(initLoop);

    // === KSA: Permute S using key ===
    a.lea(regKeyPtr, x86::ptr(x86::rip, keyData));
    a.xor_(regI, regI);            // i = 0
    a.xor_(regJ, regJ);            // j = 0
    a.bind(ksaLoop);
    // j = (j + S[i] + key[i % keyLen]) & 0xFF
    // ... (KSA permutation logic)
    // Swap S[i] and S[j]
    a.inc(regI);
    a.cmp(regI, 256);
    a.jne(ksaLoop);

    // === PRGA: Generate keystream and XOR with data ===
    a.lea(regData, x86::ptr(x86::rip, /* offset to encrypted data */));
    a.xor_(regI, regI);
    a.xor_(regJ, regJ);
    a.mov(regN, dataLen);
    a.bind(prgaLoop);
    // i = (i + 1) & 0xFF
    // j = (j + S[i]) & 0xFF
    // Swap S[i], S[j]
    // k = S[(S[i] + S[j]) & 0xFF]
    // data[n] ^= k
    a.dec(regN);
    a.jnz(prgaLoop);

    // Restore stack and continue
    a.add(x86::rsp, 256);

    a.bind(keyData);
    // RC4 key bytes follow here
}

Stack Usage in PIC Code

The RC4 decoder allocates the 256-byte S-box on the stack. This is safe in PIC code because the stack pointer is always valid regardless of where the code is loaded. However, the stub must carefully balance sub rsp and add rsp to avoid corrupting the stack frame. Shoggoth ensures the RSP adjustment is always correctly matched.

5. RIP-Relative Data Access

Both decoder stubs must locate their data (encryption keys, the encrypted payload) without using absolute addresses. Shoggoth uses two PIC techniques:

5.1 LEA with RIP-Relative Offset

The standard x86-64 approach — lea reg, [rip + offset] computes the address of data relative to the current instruction pointer:

ASM; RIP-relative addressing to locate key data
    lea  rsi, [rip + key_data]    ; rsi = address of key_data
    ; ...use rsi to read key bytes...

key_data:
    db 0xDE, 0xAD, 0xBE, 0xEF    ; embedded key

5.2 CALL/POP Technique

An alternative PIC technique sometimes used: CALL pushes the return address (next instruction) onto the stack, then POP retrieves it into a register. This gives the address of the code itself, from which data offsets can be calculated:

ASM; CALL/POP technique for PIC addressing
    call get_rip          ; push address of 'pop rsi' onto stack
get_rip:
    pop  rsi              ; rsi = address of this instruction
    add  rsi, offset      ; adjust to point at data section

asmjit’s label system handles the RIP-relative approach transparently — you bind a label to the data location and reference it in a lea instruction, and asmjit computes the correct displacement automatically.

6. Equivalent Instruction Substitution in Practice

Beyond register randomization, Shoggoth can substitute individual instructions with equivalent alternatives. Here are concrete examples used in the decoder stubs:

Operation NeededOption AOption BOption C
Zero a registerxor reg, reg (2-3 bytes)sub reg, reg (2-3 bytes)mov reg, 0 (7 bytes for 64-bit)
Copy registermov dst, srclea dst, [src]push src; pop dst
Incrementinc regadd reg, 1lea reg, [reg+1]
Compare to zerotest reg, regcmp reg, 0or reg, reg
Negateneg regnot reg; inc reg (two’s complement)xor reg, -1; add reg, 1

The engine randomly selects between available options for common operations, further differentiating the output. Combined with register randomization, even simple setup code like zeroing a counter becomes unpredictable: one run might use xor r14, r14, another sub rbx, rbx, another mov rdi, 0.

7. Putting It All Together

The complete decoder stub generation for a typical two-stage encryption produces output structured like this:

Generated Output Byte Layout

LayoutOffset 0x0000: [Junk Instructions - variable size]
Offset 0x00XX: [Block Cipher Decoder - variable size]
               - RIP-relative LEA to locate data
               - Block decryption loop with random ops
               - Junk between each operation
Offset 0x0YYY: [Junk Instructions - variable size]
Offset 0x0ZZZ: [RC4 Decoder - variable size]
               - Stack-allocated 256-byte S-box
               - KSA loop with key from embedded data
               - PRGA loop XORing with payload
               - Stack cleanup
Offset 0x0AAA: [RC4 Key Data - variable size]
Offset 0x0BBB: [Encrypted Payload - fixed size]

Every field marked “variable size” changes between generations. Even the encrypted payload changes because the keys are different. The total output size varies between runs, which is itself a polymorphic property — file size cannot be used as a reliable indicator.

Knowledge Check

Q1: Why does changing the register assignment change the machine code bytes?

In x86-64 encoding, the register number is encoded in the ModR/M byte (bits 5:3 for reg, bits 2:0 for r/m) and the REX prefix (R, X, B extension bits for registers R8-R15). Changing which register is used in an instruction changes these encoding bytes, producing completely different machine code for the same logical operation.

Q2: How does the RC4 decoder stub handle the 256-byte S-box allocation?

The RC4 decoder allocates the 256-byte S-box on the stack with sub rsp, 256. This is the correct PIC approach — the stack is always available regardless of where the code is loaded, and it avoids the need for API calls (like VirtualAlloc) that would require import resolution. The stub restores RSP when done.

Q3: What is the purpose of the lea reg, [rip + label] pattern in the decoder stubs?

The lea reg, [rip + offset] instruction computes the address of a data location relative to the current instruction pointer. Since the offset between the code and its data is fixed at generation time, this works regardless of where the code is loaded in memory — which is essential for position-independent code.