Assembly and Disassembly

What is machine language? What is assembly language? Obviously computer engineers need to know. Why should average programmers know about them?

Machine Language

Here is the machine language for a chunk of code (for an IA-32 processor) that takes (from the stack) a single 32-bit integer argument — let's call it n — and returns through eax the value 3n+1 if n is even and 4n-3 if n is odd.

1000101101001100001001000000010010001011110000011001100100110011
1100001001001011110000101000001111100000000000010011001111000010
0010101111000010100011010100010001001001000000010111010000000111
1000110100000100100011011111110111111111111111000011

Binary is too hard to read. Let's use hex:

8b 4c 24 04 8b c1 99 33 c2 2b c2 83 e0 01 33 c2
2b c2 8d 44 49 01 74 07 8d 04 8d fd ff ff ff c3

How can you tell what it does? You can look at the Intel Software Developer's Guide, Volume 2, in Appendix A, the "Opcode Map", or any of several online sources that explain the machine language. Working through that, it shows that 8B is the first byte of a MOV instruction that moves from a register or memory location into a register; to find out what the operands are we look in the following bytes. The second byte, 4C indicates that the register we are moving into is ecx, and that the source of the move is determined from the next two bytes. The bytes are 24 and 04 meaning we add the contents of esp and 4 to find the source address.

Assembly Language

Not many people can tell what 8B 4C 24 04 "does" without a lot of effort, but most with a little familiarity with the processor architecture would understand

mov     ecx, [esp+4]

This human-friendly recoding of the machine language is called assembly language.

When you go from machine language to assembly language, the process is called "disassembling". Here is the machine language from our above example, together with the disassembled code:

 0:   8b 4c 24 04             mov    ecx, [esp+4]
 4:   8b c1                   mov    eax, ecx
 6:   99                      cdq
 7:   33 c2                   xor    eax, edx
 9:   2b c2                   sub    eax, edx
 b:   83 e0 01                and    eax, 1
 e:   33 c2                   xor    eax, edx
10:   2b c2                   sub    eax, edx
12:   8d 44 49 01             lea    eax, [ecx+ecx*2+1]
16:   74 07                   je     01fh
18:   8d 04 8d fd ff ff ff    lea    eax, [ecx*4-3]
1f:   c3                      ret

Note that there are no set rules for what an assembly language should look like; in fact the version we saw above uses the NASM syntax. Here is the same program using GAS syntax:

 0:   8b 4c 24 04             movl   0x4(%esp,1),%ecx
 4:   8b c1                   movl   %ecx,%eax
 6:   99                      cltd
 7:   33 c2                   xorl   %edx,%eax
 9:   2b c2                   subl   %edx,%eax
 b:   83 e0 01                andl   $0x1,%eax
 e:   33 c2                   xorl   %edx,%eax
10:   2b c2                   subl   %edx,%eax
12:   8d 44 49 01             leal   0x1(%ecx,%ecx,2),%eax
16:   74 07                   je     0x1f
18:   8d 04 8d fd ff ff ff    leal   0xfffffffd(,%ecx,4),%eax
1f:   c3                      ret

NASM and GAS are programs called assemblers. They translate assembly language into machine language. After all, if we want our program to run, we have to get the machine code (i.e., bytes such as 8B 4C 24 04 ... into memory. We can't expect people to do this directly, so we write in assembly language and let the assembler do the rest. This process is called assembly, and if you want to do it by hand, see Appendix B of Volume 2 of the Intel Software Development Manual. You can disassemble by hand by looking at an opcode map as we pointed out above, though disassembler programs do exist. For NASM, you can use NDISASM, and for GAS, you use objdump or gdb.

Assembly language programs need more than just processor instructions (such as add, mov and so on...). They need directives to tell the assembler such things as which symbols and labels to import and export so the code can use code written in other files. Other directives are required to tell the assembler that some bytes need to be treated as data, not code, and are to eventually be stored in read-only segments when running.

Why Study This Stuff?

Even if you never program in assembly language, and even though modern compilers often produce better code than assembly language programmers, you should learn machine and assembly language because (See Bryant and O'Hallaron, page 154):

Reading the assembly language output of a compiler gives you insight into the compiler's capabilities.
Reading the assembly language output of a compiler gives you the ability to detect where and why a program is being inefficient.
Sometimes it is helpful to know where the compiler allocated your data, and how it mapped your threads to system threads.
Many attacks on computer systems exploit knowledge of the machine-level representations of programs.