--- created_at: '2018-01-24T05:53:46.000Z' title: Redundancy of x86 Machine Code (2009) url: https://www.strchr.com/machine_code_redundancy author: akkartik points: 46 story_text: comment_text: num_comments: 12 story_id: story_title: story_url: parent_id: created_at_i: 1516773226 _tags: - story - author_akkartik - story_16220646 objectID: '16220646' year: 2009 --- [Source](https://www.strchr.com/machine_code_redundancy "Permalink to Redundancy of x86 Machine Code - strchr.com") # Redundancy of x86 Machine Code - strchr.com [strchr.com][1] _Abstract:_ One assembly language instruction can be encoded differently in machine code. Possible applications are steganography and compiler identification. Created 9 years ago by _Peter Kankowski_ Last changed 4 years ago Filed under _Assembly language and machine code_ ![Share on social sites][2] # Redundancy of x86 Machine Code There are many ways to encode the same assembly language program in machine code. The encoding of ModR/M and SIB bytes is redundant; the SHL instruction has two equivalent opcodes; the operands can be swapped for some instructions. You can use this redundancy to identify the compiler, or to hide some information inside your executable files. Some examples are listed below. ## Two encodings for register-register operations A typical instruction of x86 architecture has two opcodes. The first of them has a register as the first operand and a register or a memory location as the second one (that's abbreviated "reg, reg/mem32" in the opcode reference or "Gv, Ev" in [the opcode table][3]). The operands for the second opcode are reversed (that's abbreviated "reg/mem32, reg" or "Ev, Gv"). This makes sense: the processor must know if it copies to the memory, or from the memory. But when both operands are registers, the encoding becomes redundant: ; mod reg r/m 03C3 add eax, ebx ; 11 000 011 01D8 add eax, ebx ; 11 011 000 The opcode 03 (ADD reg, r/m) instructs the processor to take the first operand from the _reg_ field, and the second from the _r/m_ field. 000 is the encoding for eax register, and 011 means ebx register. So 03C3 can be disassembled as "add eax, ebx". The opcode 01 means "ADD r/m, reg", so the first operand will be from _r/m_ (000, which means eax), and second will be from _reg_ (011, which means ebx). This again gives us "add eax, ebx". The encodings for these instructions are redundant: _add, adc, and, xor, or, sbb, sub, cmp,_ and _mov_. Some assemblers emit 03C3 for "add eax, ebx", and some emit 01D8, so this technique [can be used to identify the compiler][4] that produced the executable file. You can also swap the base and the index register if the index is not scaled (if the scale factor is 1): ; base index C60418 05 mov byte[eax + ebx*1],5 C60403 05 mov byte[ebx + eax*1],5 ## Two distinct opcode extensions for SAL/SHL [The SAL/SHL instruction can be encoded with two opcode extensions][5]: /4 and /6 (100 and 110 in binary). [Intel's manual][6] documents only /4; [AMD's manual][7] mentions both of them (however, /6 works okay on Pentiums). This again can be used to distinguish between compilers. The TEST instruction with an immediate operand also has 2 opcode extensions (/0 and /1): F7C3 05000000 test ebx, 5 F7CB 05000000 test ebx, 5 Again, the alternative encoding is documented only in AMD's manual, but also works on Intel processors. Many disassemblers and debuggers (including OllyDbg) cannot recognize the second instruction, so you can use it in anti-debugging code. ## Alternative opcode for instructions with an immediate byte operand Some instructions (namely, _add, or, adc, sbb, and, sub, xor,_ and _cmp_) have two opcodes when used with a immediate byte operand. Here is an example: 8000 00 add byte[eax],0 8200 00 add byte[eax],0 The second opcode is invalid in 64-bit mode, so the trick can be used only in 32-bit mode. ## Size-changing tricks The techniques listed above don't change the size of instruction. If the alternative encoding can have a different size, you have a wider choice of tricks: ### Special encoding for EAX and byte immediates If the immediate operand of some instructions is in range -80..7F, it can be encoded in 1 byte instead of 4 (compilers use the opcode 83 for that). But the same instruction can be encoded in larger number of bytes with opcode 81. Also, there is a special encoding for instructions with EAX register as the first operand: 83C0 01 add eax,1 81C0 01000000 add eax,1 05 01000000 add eax,1 ### Using SIB byte when it's not needed [The SIB byte][8] specifies Scale, Index and Base values, but it also can be used to address an absolute address in memory. Here are five (valid and documented) encodings for one instruction: C605 00104000 05 mov byte[401000],5 C60425 00104000 05 mov byte[401000],5 C60465 00104000 05 mov byte[401000],5 C604A5 00104000 05 mov byte[401000],5 C604E5 00104000 05 mov byte[401000],5 The same trick works with ebp-relative addressing: C645 04 05 mov byte[ebp+4],5 C64465 04 05 mov byte[ebp+4],5 C64425 04 05 mov byte[ebp+4],5 C644A5 04 05 mov byte[ebp+4],5 C644E5 04 05 mov byte[ebp+4],5 ### Using zero offset 013E add dword[esi],edi 017E 00 add dword[esi+00],edi 01BE 00000000 add dword[esi+00000000],edi This trick is used by MSVC++ compiler to emit the NOP instructions of different length (for padding before jump targets). For example, MSVC++ generates the following code if it needs 4-byte and 6-byte padding: 8d6424 00 lea [ebx+00],ebx ; 4-byte padding 8d9b 00000000 lea [esp+00000000],esp ; 6-byte padding The first line is marked as "npad 4" in assembly listings generated by the compiler, and the second is "npad 6". The registers (ebx, esp) can be chosen from the rarely used ones to avoid false dependencies in the code. ## Usage There is **a steganographical tool**, [Hydan][9], that changes x86 instructions to their equivalents and hides your message inside executable files. Hydan uses the tricks described above (not all of them: it doesn't "know" about SAL/SHL thing). It also can reorder the independent instructions and use more high-level tricks, for example, it changes "XOR eax, eax" to "SUB eax, eax" or vice versa. Hydan is an open-source program; you should have GCC installed to compile it from sources. It works with Portable Executable (.exe) and ELF formats. The x86 architecture has a long history, and the weird encodings are often maintained for compatibility with old software. The complicated addressing modes are quite redundant themselves. So, you have a good chance to hide something "between the lines" of your code. ![Peter Kankowski][10] Peter Kankowski ## About the author Peter is the developer of [Aba Search and Replace][11], a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at [kankowski@gmail.com.][12] Created 9 years ago by _Peter Kankowski_ Last changed 4 years ago ## 11 comments Ten recent comments are shown below. [Show all comments][13] Peter Kankowski, 11 years ago Try again, it may be some temporate problems with FASM forum server. bitRAKE, 10 years ago I have often thought about writing a re-code generator which can use different models for code generation. For example, use specific encodings to make code more compressible; use only ASCII instructions (has been done before); encodings that produce data for another type of file (JPG, etc.), or creates a picture. Peter Kankowski, 10 years ago It's a nice idea, but will not be easy to implement. Do you know some code generator which produces only ASCII instructions? Would you please provide an URL to it? ac, 10 years ago At first I didn't understand why would anybody need "only ASCII instructions" and only now it occurred to me: that guy wants to make an xploit, that is, to put his code in a textbox or something similar and then somehow make it executing. Otherwise any encoder would do. Not my cup of tea. arkon, 8 years ago Another cool trick first I saw it at Ken Silverman's page IIRC. sub eax, -128 _ to add 0x80, using byte imm. add eax, -128 _ to sub 0x80, using byte imm. Enjoy Michael Rolle, 8 years ago Swapping the base and index registers CAN produce different results in rare cases. The place where this is so is something like [esp + eax] and [eax + esp], because a different default segment is used. 64 bit mode doesn't have this problem because SS and DS are always equivalent. But in 16 or 32 bit mode, if your DS and SS are different segments, then you have trouble. Peter Kankowski, 8 years ago Formally, you are right, but Win32 code runs in flat memory model, where DS and SS are the same, so it's usually not a problem. David Bakin, 4 years ago The Alsys Ada compiler for x86/DOS used these techniques (esp. the multiple encodings for reg/reg moves) to encode: a) in each subroutine prologue, whether or not there were exception handlers defined in that procedure (the handlers themselves were described in a table) b) before each subroutine call, the number of items pushed on the x87 stack at the point of call (used when unwinding the stack for exceptions) c) in the compilation unit, starting from the beginning, a "signature" that proved the code was generated by the Alsys compiler (for copyright protection) Most of the time there were existing reg/reg instructions that could be used, but if necessary, a nop reg/reg instruction (e.g., mov ax,ax, which also, of course, has two encodings) would be inserted. Peter Kankowski, 4 years ago Thank you, it's an interesting example. b4 4c cd 21, 4 years ago The x86 instruction set may seem weird, but if you look at the encodings the redundancy makes sense :) 80 xx ib : alu_op eb, ib 81 xx iw : alu_op ew, iw 82 xx ib : alu_op eb, ib 83 xx ib : alu_op ew, ib (sign extended) so bit 0 encodes whether the destination is a byte or (d)word, and bit 1 means the immediate operand is a signed byte, which of course makes no difference if both operands are bytes. Your name: Comment: Enter verification code: ![Verification code][14] Please ignore this field: [strchr.com][1] Perfectionistic and minimalistic programming. ### Featured pages [Discussion: the first language][15] Which programming language to learn first? [Hash functions: An empirical comparison][16] Benchmark program for hash tables and comparison of 15 popular hash functions. [Recommended books and sites][17] The minimal reading list to become a good programmer. [Software interface design tips][18] How to design easy-to-use interfaces between modules of your program. [Implementing strcmp, strlen, and strstr using SSE 4.2 instructions][19] Using new Intel Core i7 instructions to speed up string manipulation. [Table-driven approach][20] How to make you code shorter and easier to maintain by using arrays. [x86 Machine Code Statistics][21] Which instructions and addressing modes are used most often. What is the average instruction length. ### Recent comments [Peter Kankowski:][22] Vengir, it returns plural2 for 21, 31, 41, etc. For Russian, modulo = 10 is specified for singular, hence the difference. [Vengir:][23] The way I understand the Polish algorithm, it returns singular for numbers 21, 31, 41, etc... [Ostap:][24] Beautiful ! Thanx Peter 32bit implementation was exactly what i were looking for .. Works better than native VC12 [Jessica:][25] Hi Peter... [Jessica:][26] Hi peter I am new to the x86 and was wondering If you could give me some advice on how to calculate and run butchers algorithm... (c) Peter Kankowski, 2006—2018\. Some rights reserved. [1]: https://www.strchr.com/ [2]: https://www.strchr.com/ico/social_256.png [3]: http://www.sandpile.org/x86/opc_1.htm [4]: http://board.flatassembler.net/topic.php?t=3866#27977 [5]: http://board.flatassembler.net/topic.php?t=1238&start=25#8286 [6]: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html [7]: http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/ [8]: http://www.sandpile.org/x86/opc_sib.htm [9]: http://www.crazyboy.com/hydan/ [10]: https://www.strchr.com/media/avatars/1.jpg [11]: http://www.abareplace.com [12]: mailto:kankowski%40gmail.com [13]: https://www.strchr.com/machine_code_redundancy?allcomments=1#comments [14]: https://www.strchr.com/kcaptcha.php "" [15]: https://www.strchr.com/first_language [16]: https://www.strchr.com/hash_functions [17]: https://www.strchr.com/links [18]: https://www.strchr.com/software_interface [19]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 [20]: https://www.strchr.com/table-driven [21]: https://www.strchr.com/x86_machine_code_statistics [22]: https://www.strchr.com/plural_forms#comment_827 [23]: https://www.strchr.com/plural_forms#comment_826 [24]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2#comment_825 [25]: https://www.strchr.com/calendar#comment_824 [26]: https://www.strchr.com/calendar#comment_823