Introduction to Code Generation

What about the back half of the compiler?

What is Code Generation?

The first part of a compiler analyzes the source code into a structure that carries the meaning of the program; this structure is generally the abstract syntax tree that’s been checked and decorated. (Remember decorated means all identifier references have been resolved.)

From this structure we can generate the corresponding code in some other language, the target language. This is what a code generator does.

2-part-compiler.png

Some compilers generate twice: they first generate code in some “intermediate language” like SIL, LLVM IR, HIR, MIR, CIL, etc. Then they do the “real” code generation into a target language that is directly runnable (or really close to it), like virtual machine code, assembly language, or machine language.

3-part-compiler.png

Goals of Code Generation

Here are the three goals of any good code generator:

Correctness

It should produce correct code

Efficiency

It should produce efficient code

Self-Efficiency

It should itself be efficient

Exercise: The last two goals are seemingly in conflict. Discuss.

What Kind of Code Can We Generate?

There is no one way to do code generation. There is no one target of a code generator. A code generator can produce:

Are there other kinds of code?

Two More Questions

Also consider:

Transpiling

JavaScript makes a great target language? Why? It’s got first-class functions and async support, and that Node.js thing runs on that awesome V8. And the ecosystem! How can we target JS?

Start by browsing an existing example. This one, for the language Carlos, translates the decorated AST (Semantic Graph) directly into JavaScript (there’s no need to use an IR if you are targeting a high-level language):

generator.js
// CODE GENERATOR: Carlos -> JavaScript
//
// Invoke generate(program) with the program node to get back the JavaScript
// translation as a string.

import { IfStatement, Type, StructType } from "./core.js"
import * as stdlib from "./stdlib.js"

export default function generate(program) {
  const output = []

  const standardFunctions = new Map([
    [stdlib.contents.print, x => `console.log(${x})`],
    [stdlib.contents.sin, x => `Math.sin(${x})`],
    [stdlib.contents.cos, x => `Math.cos(${x})`],
    [stdlib.contents.exp, x => `Math.exp(${x})`],
    [stdlib.contents.ln, x => `Math.log(${x})`],
    [stdlib.contents.hypot, ([x, y]) => `Math.hypot(${x},${y})`],
    [stdlib.contents.bytes, s => `[...Buffer.from(${s}, "utf8")]`],
    [stdlib.contents.codepoints, s => `[...(${s})].map(s=>s.codePointAt(0))`],
  ])

  // Variable and function names in JS will be suffixed with _1, _2, _3,
  // etc. This is because "switch", for example, is a legal name in Carlos,
  // but not in JS. So, the Carlos variable "switch" must become something
  // like "switch_1". We handle this by mapping each name to its suffix.
  const targetName = (mapping => {
    return entity => {
      if (!mapping.has(entity)) {
        mapping.set(entity, mapping.size + 1)
      }
      return `${entity.name ?? entity.description}_${mapping.get(entity)}`
    }
  })(new Map())

  function gen(node) {
    return generators[node.constructor.name](node)
  }

  const generators = {
    // Key idea: when generating an expression, just return the JS string; when
    // generating a statement, write lines of translated JS to the output array.
    Program(p) {
      gen(p.statements)
    },
    VariableDeclaration(d) {
      // We don't care about const vs. let in the generated code! The analyzer has
      // already checked that we never updated a const, so let is always fine.
      output.push(`let ${gen(d.variable)} = ${gen(d.initializer)};`)
    },
    TypeDeclaration(d) {
      // The only type declaration in Carlos is the struct! Becomes a JS class.
      output.push(`class ${gen(d.type)} {`)
      output.push(`constructor(${gen(d.type.fields).join(",")}) {`)
      for (let field of d.type.fields) {
        output.push(`this[${JSON.stringify(gen(field))}] = ${gen(field)};`)
      }
      output.push("}")
      output.push("}")
    },
    StructType(t) {
      return targetName(t)
    },
    Field(f) {
      return targetName(f)
    },
    FunctionDeclaration(d) {
      output.push(`function ${gen(d.fun)}(${gen(d.fun.parameters).join(", ")}) {`)
      gen(d.body)
      output.push("}")
    },
    Parameter(p) {
      return targetName(p)
    },
    Variable(v) {
      // Standard library constants just get special treatment
      if (v === stdlib.contents.π) {
        return "Math.PI"
      }
      return targetName(v)
    },
    Function(f) {
      return targetName(f)
    },
    Increment(s) {
      output.push(`${gen(s.variable)}++;`)
    },
    Decrement(s) {
      output.push(`${gen(s.variable)}--;`)
    },
    Assignment(s) {
      output.push(`${gen(s.target)} = ${gen(s.source)};`)
    },
    BreakStatement(s) {
      output.push("break;")
    },
    ReturnStatement(s) {
      output.push(`return ${gen(s.expression)};`)
    },
    ShortReturnStatement(s) {
      output.push("return;")
    },
    IfStatement(s) {
      output.push(`if (${gen(s.test)}) {`)
      gen(s.consequent)
      if (s.alternate.constructor === IfStatement) {
        output.push("} else")
        gen(s.alternate)
      } else {
        output.push("} else {")
        gen(s.alternate)
        output.push("}")
      }
    },
    ShortIfStatement(s) {
      output.push(`if (${gen(s.test)}) {`)
      gen(s.consequent)
      output.push("}")
    },
    WhileStatement(s) {
      output.push(`while (${gen(s.test)}) {`)
      gen(s.body)
      output.push("}")
    },
    RepeatStatement(s) {
      // JS can only repeat n times if you give it a counter variable!
      const i = targetName({ name: "i" })
      output.push(`for (let ${i} = 0; ${i} < ${gen(s.count)}; ${i}++) {`)
      gen(s.body)
      output.push("}")
    },
    ForRangeStatement(s) {
      const i = targetName(s.iterator)
      const op = s.op === "..." ? "<=" : "<"
      output.push(`for (let ${i} = ${gen(s.low)}; ${i} ${op} ${gen(s.high)}; ${i}++) {`)
      gen(s.body)
      output.push("}")
    },
    ForStatement(s) {
      output.push(`for (let ${gen(s.iterator)} of ${gen(s.collection)}) {`)
      gen(s.body)
      output.push("}")
    },
    Conditional(e) {
      return `((${gen(e.test)}) ? (${gen(e.consequent)}) : (${gen(e.alternate)}))`
    },
    BinaryExpression(e) {
      const op = { "==": "===", "!=": "!==" }[e.op] ?? e.op
      return `(${gen(e.left)} ${op} ${gen(e.right)})`
    },
    UnaryExpression(e) {
      if (e.op === "some") {
        e.op = ""
      }
      return `${e.op}(${gen(e.operand)})`
    },
    EmptyOptional(e) {
      return "undefined"
    },
    SubscriptExpression(e) {
      return `${gen(e.array)}[${gen(e.index)}]`
    },
    ArrayExpression(e) {
      return `[${gen(e.elements).join(",")}]`
    },
    EmptyArray(e) {
      return "[]"
    },
    MemberExpression(e) {
      const object = gen(e.object)
      const field = JSON.stringify(gen(e.field))
      const chain = e.isOptional ? "?." : ""
      return `(${object}${chain}[${field}])`
    },
    Call(c) {
      const targetCode = standardFunctions.has(c.callee)
        ? standardFunctions.get(c.callee)(gen(c.args))
        : c.callee.constructor === StructType
        ? `new ${gen(c.callee)}(${gen(c.args).join(", ")})`
        : `${gen(c.callee)}(${gen(c.args).join(", ")})`
      // Calls in expressions vs in statements are handled differently
      if (c.callee instanceof Type || c.callee.type.returnType !== Type.VOID) {
        return targetCode
      }
      output.push(`${targetCode};`)
    },
    Number(e) {
      return e
    },
    BigInt(e) {
      return e
    },
    Boolean(e) {
      return e
    },
    String(e) {
      return e
    },
    Array(a) {
      return a.map(gen)
    },
  }

  gen(program)
  return output.join("\n")
}

What did you notice from browsing these examples? A few tips and tricks, right? Things like:

Here is a view of some of the Carlos functions in table form:

Graph FragmentJavaScript
     program
     /     \
    s1  ..  sn
s1';
 .
 .
sn';
     vardecl
     /  |  \
    v   T   E
let v' = E'
 structtypedecl
 /       \
t    structtype
      /      \
    field .. field
     / \      / \
    x1  T1   xn  Tn
class t' {
  constructor (x1', .., xn') {
    this.x1' = x1';
    ...
    this.xn' = xn';
  }
}
     functiondecl
      /       |  \
   function   T   B
  /  |      \
f  param .. param
     / \     / \
    x1  T1  xn  Tn
function f(x1', .., xn') {
  B
}
       inc
        |
        V
V'++
       dec
        |
        V
V'--
      assign
      /    \
     V      E
V' = E'
      break
break
      return
return
      return
        |
        E
return E'
       if
      /  \
     E    B
if (E') { B' }
       if
     /  | \
    E  B1  B2
if (E') { B1' } else { B2'}
// But no braces around B2 if it’s an if-statement
      while
      /   \
     E     B
while (E') { B' }
     repeat
     /    \
    E      B
for (let i = 0; i < E'; i++) { B' }
// (note: i is a new variable)
    for-upto
   /  |  |  \
  x   E1 E2  B
for (let x' = E1'; x' < E2'; x'++) { B' }
   for-through
  /  |   |   \
 x   E1  E2   B
for (let x' = E1'; x' <= E2'; x'++) { B' }
  for-collection
     /  |   \
    x   A    B
for (let x' of A') { B' }
   conditional
    /  |   \
   E   E1   E2
((E') ? (E1') : (E2'))
        +
       / \
     E1   E2
(E1' + E2)
// same for all other binary ops
// translated where necessary (e.g. == → ===)
        -
        |
        E
(- E')
// same for all other unary ops
// except "some", which is a no-op in JS
  emptyoptional
undefined
    subscript
      /   \
     A     E
A'[E']
 arrayexpression
    /      \
   E1 ..... En
[E1', ...., En']
      member
      /    \
     O      F
O'["F'"]
// quotes are important
     fn-call
    /  |   \
   F   E1 .. En
F'(E1', ...., En')
// Standard Library functions handled inline
    ctor-call
    /  |   \
   C   E1 .. En
new C'(E1', ...., En')

Virtual Machines and Backend Language Libraries

LLVM

LLVM is, well, it’s huge, it’s impressive, and lots of compilers for modern languages are using it. It is a library with a ton of stuff, including an intermediate language and backends. It’s official name is The LLVM Compiler Infrastructure Project.

A take on modern compiler construction from Andi McClure:

llvm-humor.png

Exercise: Read the whole blog post from whence came that awesome picture.

JVM

Read about the JVM at Wikipedia and about its formal specification.

Web Assembly

Read about Web Assembly

Exercise: Explain WASM, briefly, in your own words.

Write your own VM!

SERIOUSLY. DO IT.

Targeting a Typical Processor

Okay let’s say you’re going to bypass LLVM or anything else and you are going straight to assembly. Or, let’s say you are joining the LLVM team and want to enhance the project. Or you want to write a new backend for an intermediate representation. Lots of good stuff to know. Here are some notes that just scratch the surface of generating assembly language.

Address Assignment

TODO - Fortran no recursion local vars, even return val on stack

TODO - C, show assembly for recursive function

TODO - C, show space for local vars

TODO - motivate stack frames

TODO - motivate dynamic links

TODO - Three approaches to static scope

TODO - Static link

TODO - Single display

TODO - Display in every frame

TODO - No need for display or static link in leaf subroutine

TODO - Who saves the registers

TODO - Inlining?

TODO - Summary of questions and concerns

Instruction Selection

TODO