Here is the machine language for a chunk of code (for an IA-32 processor) that takes (from the stack) a single 32-bit integer argument — let's call it n — and returns through eax the value 3n+1 if n is even and 4n-3 if n is odd.
1000101101001100001001000000010010001011110000011001100100110011 1100001001001011110000101000001111100000000000010011001111000010 0010101111000010100011010100010001001001000000010111010000000111 1000110100000100100011011111110111111111111111000011
Binary is too hard to read. Let's use hex:
8b 4c 24 04 8b c1 99 33 c2 2b c2 83 e0 01 33 c2 2b c2 8d 44 49 01 74 07 8d 04 8d fd ff ff ff c3
How can you tell what it does? You can look at the Intel Software Developer's Guide, Volume 2, in Appendix A, the "Opcode Map", or any of several online sources that explain the machine language. Working through that, it shows that 8B is the first byte of a MOV instruction that moves from a register or memory location into a register; to find out what the operands are we look in the following bytes. The second byte, 4C indicates that the register we are moving into is ecx, and that the source of the move is determined from the next two bytes. The bytes are 24 and 04 meaning we add the contents of esp and 4 to find the source address.
Not many people can tell what 8B 4C 24 04 "does" without a lot of effort, but most with a little familiarity with the processor architecture would understand
mov ecx, [esp+4]
This human-friendly recoding of the machine language is called assembly language.
When you go from machine language to assembly language, the process is called "disassembling". Here is the machine language from our above example, together with the disassembled code:
0: 8b 4c 24 04 mov ecx, [esp+4] 4: 8b c1 mov eax, ecx 6: 99 cdq 7: 33 c2 xor eax, edx 9: 2b c2 sub eax, edx b: 83 e0 01 and eax, 1 e: 33 c2 xor eax, edx 10: 2b c2 sub eax, edx 12: 8d 44 49 01 lea eax, [ecx+ecx*2+1] 16: 74 07 je 01fh 18: 8d 04 8d fd ff ff ff lea eax, [ecx*4-3] 1f: c3 ret
Note that there are no set rules for what an assembly language should look like; in fact the version we saw above uses the NASM syntax. Here is the same program using GAS syntax:
0: 8b 4c 24 04 movl 0x4(%esp,1),%ecx 4: 8b c1 movl %ecx,%eax 6: 99 cltd 7: 33 c2 xorl %edx,%eax 9: 2b c2 subl %edx,%eax b: 83 e0 01 andl $0x1,%eax e: 33 c2 xorl %edx,%eax 10: 2b c2 subl %edx,%eax 12: 8d 44 49 01 leal 0x1(%ecx,%ecx,2),%eax 16: 74 07 je 0x1f 18: 8d 04 8d fd ff ff ff leal 0xfffffffd(,%ecx,4),%eax 1f: c3 ret
NASM and GAS are programs called assemblers. They translate assembly language into machine language. After all, if we want our program to run, we have to get the machine code (i.e., bytes such as 8B 4C 24 04 ... into memory. We can't expect people to do this directly, so we write in assembly language and let the assembler do the rest. This process is called assembly, and if you want to do it by hand, see Appendix B of Volume 2 of the Intel Software Development Manual. You can disassemble by hand by looking at an opcode map as we pointed out above, though disassembler programs do exist. For NASM, you can use NDISASM, and for GAS, you use objdump or gdb.
Assembly language programs need more than just processor instructions
(such as add
, mov
and so on...). They need directives
to tell the assembler such things as which symbols and labels to import
and export so the code can use code written in other files. Other
directives are required to tell the assembler that some bytes
need to be treated as data, not code, and are to eventually be
stored in read-only segments when running.
Even if you never program in assembly language, and even though modern compilers often produce better code than assembly language programmers, you should learn machine and assembly language because (See Bryant and O'Hallaron, page 154):