Intermediate Representations

You can't really write a translator, or even an iterpreter, without using an intermediate language.

What is an IR?

An intermediate representation is a representation of a program part way between the source and target languages. A good IR is one that is fairly independent of the source and target languages, so that it maximizes its ability to be used in a retargetable compiler.

Why use an IR?

Lots of reasons:

Styles

Intermediate representations are usually:

We'll use the same example code fragment to illustrate all the styles.

while (x < 4 * y) {
    x = y / 3 >> x;
    if (y) print x - 3;
}

Semantic Graph

irgraphexample.gif

Tuples

Here's the example with tuples. Tuples are stored as on the left column, but usually rendered as on the right.

(JUMP, L2)                          goto L2
(LABEL, L1)                    L1:
(SHR, 3, x, t0)                     t0 := 3 >> x
(DIV, y, t0, t1)                    t1 := y / t0
(COPY, t1, x)                       x := t1
(JZ, y, L3)                         if y == 0 goto L3
(SUB, x, 3, t2)                     t2 := x - 3
(PRINT, t2)                         print t2
(LABEL, L3)                    L3:
(LABEL, L2)                    L2:
(MUL, 4, y, t4)                     t4 := 4 * y
(LT, x, t4, t5)                     x := t4 < t5
(JNZ, t5, L1)                       if t5 != 0 goto L1

Stack Code

    goto L2
L1:
    load y
    load_constant 3
    load x
    shr
    div
    store x
    load y
    jump_if_zero L3
    load x
    load_constant 3
    sub
    print
L3:
L2:
    load x
    load_constant 4
    load y
    mul
    less_than
    jump_if_not_zero L1

Levels

Genenerally we recognize three levels

Here is a running C++ example to illustrate all three

double a[20][10];
.
.
.
for (int i = 0; i < n; i += di)
    a[i][j+2] = j;

High-Level

Example

(COPY, 0, i)                        i := 0
(LABEL, L1)                    L1:
(JGE, i, n, L2)                     if i >= n goto L2
(INDEX, a, i, t0)                   t0 := a[i]
(ADD, j, 2, t1)                     t1 := j + 2
(INDEX, t0, t1, t2)                 t2 := t0[t1]
(COPY_TO_DEREF, j, t2)              *t2 := j
(INCJUMP, i, di, L1)                i += di, goto L1
(LABEL, L2)                    L2:

Medium-Level

Example

When constructing a medium level IR, the IR generator must be told about the sizes of source language datatypes. Assume here that double require 8 bytes and ints require 4 bytes.

(COPY, 0, i)                        i := 0
(LABEL, L1)                    L1:
(JGE, i, n, L2)                     if i >= n goto L2
(MUL, i, 80, t0)                    t0 := i * 80
(ADD, a, t0, t1)                    t1 := a + t0
(ADD, j, 2, t2)                     t2 := j + 2
(MUL, t2, 8, t3)                    t3 := t2 * 8
(ADD, t1, t3, t4)                   t4 := t1 + t3
(COPY_TO_DEREF, j, t4)              *t4 := j
(ADD, i, di, i)                     i := i + di
(JUMP, L1)                          goto L1
(LABEL, L2)                    L2:

Note: some languages will require bounds checking on arrays and a different treatment of for-loops.

Low-Level

Example

Suppose in our example we know the target is a RISC processor with no memory operations. We would notice, for this block that needn't be stored in memory, we'll just use r0. j can go in r1 and n in r2 and di in r3. They're not changed in this loop, so we needn't save them at the end. We might get something like

(LDC, 0, r0)                        r0 := 0
(LOAD, j, r1)                       r1 := j
(LOAD, n, r2)                       r2 := n
(LOAD, di, r3)                      r3 := di
(LOAD, a, r4)                       r4 := a
(LABEL, L1)                    L1:
(JGE, r0, r2, L2)                   if r0 >= r2 goto L2
(MUL, r0, 80, r5)                   r5 := r0 * 80
(ADD, r4, r5, r6)                   r6 := r4 + r5
(ADD, r1, 2, r7)                    r7 := r1 + 2
(MUL, r7, 8, r8)                    r8 := r7 * 8
(ADD, r6, r8, r9)                   r9 := r6 + r8
(TOFLOAT, r1, f0)                   f0 := tofloat r1
(STOREIND, f0, r9)                  *r9 := f0
(ADD, r0, r3, r0)                   r0 := r0 + r3
(JUMP, L1)                          goto L1
(LABEL, L2)                    L2:

Structures

Some structures generally associated with IRs include

Control Flow Graphs

Definition

A control flow graph is a graph whose nodes are basic blocks and whose edges are transitions between blocks.

Basic Blocks

A basic block is a

... in the abscence of hardware faults, interrupts, crashes, threading problems, etc.

To locate basic blocks in flattened code:

Example Control Flow Graph

     goto L2
L1:
     t0 := 3 >> x
     t1 := y / t0
     x := t1
     if y == 0 goto L3
     t2 := x - 3
     print t2
L3:
L2:
     t4 := 4 * y
     x := t4 < t5
     if t5 != 0 goto L1

controlgraphexample.gif

Example IRs

GNU RTL
The intermediate language for the many source and target languages of the GNU Compiler Collection.
Diana
Descriptive Intermediate Attributed Notation for Ada. No longer used by major Ada compilers.
PCODE
The intermediate language of early Pascal compilers. Stack based. Responsible for wide adoption of Pascal in the 1970s.
Java Virtual Machine
Another virtual machine specification. Almost all Java compilers and some Ada compilers use this format. JVM code can be interpreted, run on specialized hardware, or jitted.
Squid
One I made up myself.
WIL
An intermediate language designed for object-oriented source languages.
CIL
Common Intermediate Langauge. Langauges in Microsoft's .NET framework (such as C+, VB.NET, etc.) compile to CIL, which is then assembled into bytecode.
C
Why not? It's widely available and the whole back end is already done within the C compiler.
C--
Kind of like using C, but C-- is designed explicitly to be an intermediate language, and even includes a run-time interface to make it easier to do garbage collection and exception handling.

Case Study: An IR for C

C is such a simple language, an IR is fairly easy to design. Why is C so "simple"?

Thus, the following tuples are sufficient:

    x ← y                        x ← y[i]
    x ← &y                       x[i] ← y
    x ← *y                       goto L
    *x ← y                       if x relop y goto L
    x ← unaryop y                param x
    x ← y binaryop z             call p, n

    unaryop is one of: +, -, !, ~, ...
    binaryop is one of: +, -, *, /, %, &, |, ^, ., &&, ||, ...
    relop is one of ==, !=, <, <=, >, >=
    x[i] means i memory location x + i
    call p,n means call p with n bytes of arguments