Compiler Architecture

Compilers are programs, and generally very large programs. They almost always have a structure based on the analysis-synthesis model of translation.

Overview

Compilers perform translation. Every non-trivial translation requires analysis and synthesis

analsyn.png

Both analysis and synthesis are made up of internal phases.

Compiler Components

These are the main functional components of a production compiler that looks to generate assembly or machine language (if you are just targeting a high-level language like C, or a virtual machine, you might not have so many phases):

compilerphases.png

You might also identify an error recovery subsystem and a symbol table manager, too.

Lexical Analysis (Scanning)

The scanner converts the source program's stream of characters into a stream of tokens, removing whitespace, removing comments and expanding macros along the way.

An example in C

#define ZERO 0
    unsigned  gcd(   unsigned   int  // Euclid's algorithm
      x,unsigned   y) {   while ( /* hello */  x>   ZERO
   ){unsigned temp=x;x=y   %x;y  = temp ;}return y ;}

gets tokenized into

unsigned ID(gcd) ( unsigned int ID(x) , unsigned ID(y) ) { while ( ID(x) > INTLIT(0) ) { unsigned ID(temp) = ID(x) ; ID(x) = ID(y) % ID(x) ; ID(y) = ID(temp) ; } return ID(y) ; }

Scanners are concerned with issues such as

Errors that might occur during scanning, called lexical errors include

Syntax Analysis (Parsing)

The parser turns the token sequence into an abstract syntax tree. For the example above, we get this tree:

gcdast1.png

The tree can also be stored as a string

(fundecl unsigned gcd
  (params (param unsigned x) (param unsigned y))
  (block
    (while
      (> x 0)
      (block (vardecl unsigned temp y) (= x (% y x)) (= y temp)))
    (return y)))

Or, each node in the tree is stored as an object with named fields, many of whose values are themselves nodes in the tree. Note that at this stage in the compilation, the tree is definitely just a tree. There are no cycles.

gcdast2.png

When constructing a parser, one needs to be concerned with grammar complexity (such as whether the grammar is LL or LR), and whether there are any disambiguation rules that might have to be hacked in. Some languages actually require a bit of semantic analysis to parse.

Exercise: Show that the expression (x)-y in C can have two different syntactic interpretations. Hint: Your answer will probably contain the words "subtraction", "typedef", "cast", and "negation".

Errors that might occur during parsing, called syntax errors include things like following, in C

Semantic Analysis

During semantic analysis we have to check legality rules and while doing so, we tie up the pieces of the syntax tree (by resolving identifier references, inserting cast operations for implicit coercions, etc.) to form a semantic graph.

Continuing the example above:

gcdsemgraph.png

Obviously, the set of legality rules is different for each language. Examples of legality rules you might see in a Java-like language include:

Exercise: Give examples of each of the above.

Errors occurring during this phase are called static semantic errors.

Exercise: The language Pascal has an unusual syntax when it comes to expressions: it gives the "and" operator (which requires boolean operands) higher precedence than relational operators! Show how this implies that the expression x-4<=5 and 2<y is a static semantic error.

Intermediate Code Generation

The intermediate code generator produces a flow graph made up of tuples grouped into basic blocks. For the example above, we'd see:

gcdflowgraph.png

You can read more about intermediate representations elsewhere.

Machine Independent Code Improvement

Code improvement that is done on the semantic graph or on the intermediate code is called machine independent code optimization. In practice there are zillions of known optimzations (er, improvements), but none really apply to our running example.

Code Generation

Code generation produces the actual target code, or something close. In traditional compilers targeting a RISC machine, the first code generation phase produces assembly language that assumes an infinite set of virtual registers. For CISC machines this probably won't be the case, but you never know. Here is code for our running example generated by gcc 4.2.1 for the x86 on Mac OSX 10.6:

        .text
        .globl _gcd
_gcd:
LFB2:
        pushq	%rbp
LCFI0:
        movq	%rsp, %rbp
LCFI1:
        movl	%edi, -20(%rbp)
        movl	%esi, -24(%rbp)
        jmp     L2
L3:
        movl	-20(%rbp), %eax
        movl	%eax, -4(%rbp)
        movl	-24(%rbp), %eax
        movl	$0, %edx
        divl	-20(%rbp)
        movl	%edx, -20(%rbp)
        movl	-4(%rbp), %eax
        movl	%eax, -24(%rbp)
L2:
        cmpl	$0, -20(%rbp)
        jne     L3
        movl	-24(%rbp), %eax
        leave
        ret

Machine Dependent Code Improvement

Usually the final phase in compilation is cleaning up and improving the target code. For the example above, I got this when setting the optimization level to -O6:

        .text
        .align	4,0x90
        .globl	_gcd
_gcd:
LFB2:
        pushq	%rbp
LCFI0:
        movq	%rsp, %rbp
LCFI1:
        movl	%esi, %eax
        testl	%edi, %edi
        jne	L7
        jmp	L2
        .align	4,0x90
L9:
        movl	%edx, %edi
L7:
        xorl	%edx, %edx
        divl	%edi
        movl	%edi, %eax
        testl	%edx, %edx
        jne	L9
L2:
        leave
        ret