Code Generation

What about the back half of the compiler?

What is Code Generation?

The first part of a compiler analyzes the source code into a structure that carries the meaning of the program; this structure is generally the abstract syntax tree that’s been checked and decorated. (Remember decorated means all identifier references have been resolved.)

From this structure we can generate the corresponding code in some other language, the target language. This is what a code generator does.

Some compilers generate twice: they first generate code in some “intermediate language” like SIL, LLVM IR, HIR, MIR, CIL, etc. Then they do the “real” code generation into a target language that is directly runnable (or really close to it), like virtual machine code, assembly language, or machine language.

Goals of Code Generation

Here are the three goals of any good code generator:

Correctness

It should produce correct code

Efficiency

It should produce efficient code

Self-Efficiency

It should itself be efficient

Exercise: The last two goals are seemingly in conflict. Discuss.

What Kind of Code Can We Generate?

There is no one way to do code generation. There is no one target of a code generator. A code generator can produce:

Code in another high-level language. Don’t knock it. If the target high-level language has a good compiler already, this is a great strategy. Stand on the shoulders of giants right?
C. Oh wait, C is a high-level language. Sort of. Wait no. No lambdas. But then you can let the C compiler take over to do the rest.
An intermediate representation like LLVM, that has backends for dozens of architectures.
Executable virtual machine code, such as for the JVM or BEAM. Targeting a virtual machine is a ton of fun. Learning about a VM and its instruction set is a great experience. You’ll get a great feeling of accomplishment. And you can run this code, too.
Assembly language: Because it’s so fulfilling to learn about machine architecture, especially pipelines, caches, and of course, registers! You need to know your machine. Is it a RISC or CISC processor or some kind of novel architecture like a dataflow processor or somthing with hundreds of cores? What are the relative instruction speeds? How are its addressing modes? Does it have out-of-order instruction? Any cool idioms?
Relocatable machine language: you’ve done a lot of assembly that the assembler would have done for you (resolving symbols), but at least you will let the linker do the rest.
Absolute machine language: you’ve resolved all the memory locations; in other words you spent a lot of time doing what assemblers and linkers would do for you already.

Are there other kinds of code?

Two More Questions

Also consider:

Should we generate awesome code on the fly? Or let a separate optimizing pass clean up after our generator? Or not optimize at all? For example, naive code generation can lead to not so great code:

  x = x + 1;      mov rax, x_0    ⎫
                  add rax, 1      ⎬	just need inc x_0
                  mov x_0, rax    ⎭

  x = y + c;      mov rax, y_5
  z = x + 4;      add rax, c_8
                  mov x_2, rax    ; Redundant if x not used again
                  mov rax, x_2    ; Redundant
                  add rax, 4
                  mov z_7, rax

Should we stick in line number information so a debugger can take over if our program is stopped intentionally (or crashes)?

Transpiling

JavaScript makes a great target language? Why? It’s got first-class functions and async support, and that Node.js thing runs on that awesome V8. And the ecosystem! How can we target JS?

Start by browsing an existing example. This one, for the language Carlos, translates the decorated AST (Semantic Graph) directly into JavaScript (there’s no need to use an IR if you are targeting a high-level language):

generator.js

// The code generator exports a single function, generate(program), which
// accepts a program representation and returns the JavaScript translation
// as a string.

import { voidType, standardLibrary } from "./core.js"

export default function generate(program) {
  // When generating code for statements, we'll accumulate the lines of
  // the target code here. When we finish generating, we'll join the lines
  // with newlines and return the result.
  const output = []

  // Variable and function names in JS will be suffixed with _1, _2, _3,
  // etc. This is because "switch", for example, is a legal name in Carlos,
  // but not in JS. So, the Carlos variable "switch" must become something
  // like "switch_1". We handle this by mapping each name to its suffix.
  const targetName = (mapping => {
    return entity => {
      if (!mapping.has(entity)) {
        mapping.set(entity, mapping.size + 1)
      }
      return `${entity.name}_${mapping.get(entity)}`
    }
  })(new Map())

  const gen = node => generators?.[node?.kind]?.(node) ?? node

  const generators = {
    // Key idea: when generating an expression, just return the JS string; when
    // generating a statement, write lines of translated JS to the output array.
    Program(p) {
      p.statements.forEach(gen)
    },
    VariableDeclaration(d) {
      // We don't care about const vs. let in the generated code! The analyzer has
      // already checked that we never updated a const, so let is always fine.
      output.push(`let ${gen(d.variable)} = ${gen(d.initializer)};`)
    },
    TypeDeclaration(d) {
      // The only type declaration in Carlos is the struct! Becomes a JS class.
      output.push(`class ${gen(d.type)} {`)
      output.push(`constructor(${d.type.fields.map(gen).join(",")}) {`)
      for (let field of d.type.fields) {
        output.push(`this[${JSON.stringify(gen(field))}] = ${gen(field)};`)
      }
      output.push("}")
      output.push("}")
    },
    StructType(t) {
      return targetName(t)
    },
    Field(f) {
      return targetName(f)
    },
    FunctionDeclaration(d) {
      output.push(`function ${gen(d.fun)}(${d.fun.params.map(gen).join(", ")}) {`)
      d.fun.body.forEach(gen)
      output.push("}")
    },
    Variable(v) {
      // Standard library constants get special treatment
      if (v === standardLibrary.π) return "Math.PI"
      return targetName(v)
    },
    Function(f) {
      return targetName(f)
    },
    Increment(s) {
      output.push(`${gen(s.variable)}++;`)
    },
    Decrement(s) {
      output.push(`${gen(s.variable)}--;`)
    },
    Assignment(s) {
      output.push(`${gen(s.target)} = ${gen(s.source)};`)
    },
    BreakStatement(s) {
      output.push("break;")
    },
    ReturnStatement(s) {
      output.push(`return ${gen(s.expression)};`)
    },
    ShortReturnStatement(s) {
      output.push("return;")
    },
    IfStatement(s) {
      output.push(`if (${gen(s.test)}) {`)
      s.consequent.forEach(gen)
      if (s.alternate?.kind?.endsWith?.("IfStatement")) {
        output.push("} else")
        gen(s.alternate)
      } else {
        output.push("} else {")
        s.alternate.forEach(gen)
        output.push("}")
      }
    },
    ShortIfStatement(s) {
      output.push(`if (${gen(s.test)}) {`)
      s.consequent.forEach(gen)
      output.push("}")
    },
    WhileStatement(s) {
      output.push(`while (${gen(s.test)}) {`)
      s.body.forEach(gen)
      output.push("}")
    },
    RepeatStatement(s) {
      // JS can only repeat n times if you give it a counter variable!
      const i = targetName({ name: "i" })
      output.push(`for (let ${i} = 0; ${i} < ${gen(s.count)}; ${i}++) {`)
      s.body.forEach(gen)
      output.push("}")
    },
    ForRangeStatement(s) {
      const i = targetName(s.iterator)
      const op = s.op === "..." ? "<=" : "<"
      output.push(`for (let ${i} = ${gen(s.low)}; ${i} ${op} ${gen(s.high)}; ${i}++) {`)
      s.body.forEach(gen)
      output.push("}")
    },
    ForStatement(s) {
      output.push(`for (let ${gen(s.iterator)} of ${gen(s.collection)}) {`)
      s.body.forEach(gen)
      output.push("}")
    },
    Conditional(e) {
      return `((${gen(e.test)}) ? (${gen(e.consequent)}) : (${gen(e.alternate)}))`
    },
    BinaryExpression(e) {
      if (e.op === "hypot") return `Math.hypot(${gen(e.left)},${gen(e.right)})`
      const op = { "==": "===", "!=": "!==" }[e.op] ?? e.op
      return `(${gen(e.left)} ${op} ${gen(e.right)})`
    },
    UnaryExpression(e) {
      const operand = gen(e.operand)
      if (e.op === "some") return operand
      if (e.op === "#") return `${operand}.length`
      if (e.op === "random") return `((a=>a[~~(Math.random()*a.length)])(${operand}))`
      if (e.op === "codepoints") return `[...(${operand})].map(s=>s.codePointAt(0))`
      if (e.op === "bytes") return `[...Buffer.from(${operand}, "utf8")]`
      if (e.op === "sqrt") return `Math.sqrt(${operand})`
      if (e.op === "sin") return `Math.sin(${operand})`
      if (e.op === "cos") return `Math.cos(${operand})`
      if (e.op === "exp") return `Math.exp(${operand})`
      if (e.op === "ln") return `Math.log(${operand})`
      return `${e.op}(${operand})`
    },
    EmptyOptional(e) {
      return "undefined"
    },
    SubscriptExpression(e) {
      return `${gen(e.array)}[${gen(e.index)}]`
    },
    ArrayExpression(e) {
      return `[${e.elements.map(gen).join(",")}]`
    },
    EmptyArray(e) {
      return "[]"
    },
    MemberExpression(e) {
      const object = gen(e.object)
      const field = JSON.stringify(gen(e.field))
      const chain = e.op === "." ? "" : e.op
      return `(${object}${chain}[${field}])`
    },
    FunctionCall(c) {
      const targetCode = `${gen(c.callee)}(${c.args.map(gen).join(", ")})`
      // Calls in expressions vs in statements are handled differently
      if (c.callee.type.returnType !== voidType) {
        return targetCode
      }
      output.push(`${targetCode};`)
    },
    ConstructorCall(c) {
      return `new ${gen(c.callee)}(${c.args.map(gen).join(", ")})`
    },
    Print(s) {
      output.push(`console.log(${s.args.map(gen).join(", ")});`)
    },
  }

  gen(program)
  return output.join("\n")
}

What did you notice from browsing these examples? A few tips and tricks, right? Things like:

You write a generate method for each AST node, and each node’s generate method invokes the generate method of its sub nodes. The top-level function gen dispatches to the right generate method
When writing out high-level code, it is good to over-parenthesize: you should probably not assume your associativities and precedences and arities and fixities are the same as those of your target language.
We should decorate the ids in the target language with uniquely identifying suffixes, because (1) the set of reserved words may differ between source and target languages, and (2) you never know whether scope rules are the same!
While not shown in the examples above, character sets may very well be different between the two languages; keep that in mind.
Inline some standard functions here. (But where standard functions cannot be inlined, your code generator can always write out the function implementation in the target language.)
If your source language has static type checking, static access checks, static matching checks, static read-only checks, safe optionals, etc., these have already been done in the analyzer, and do not need to be handled in the generator! The Carlos generator above is an example of this.
However, if your source language is dynamic, then you must generate code to do the type, access, read-only, null, and matching checks!
Even for very static languages, some checks will still have to be done in generated code, like array length checks (when the array is a variable whose size is unknown until run-time)
There may be a ton of ad-hoc things that pop up, but these are fun challenges.

Here is a view of some of the Carlos functions in table form:

Graph Fragment	JavaScript
program / \ s₁ .. s_n	s₁'; . . s_n';
vardecl / \| \ v T E	let v' = E'
structtypedecl / \ t structtype / \ field .. field / \ / \ x₁ T₁ x_n T_n	class t' { constructor (x₁', .., x_n') { this.x₁' = x₁'; ... this.x_n' = x_n'; } }
functiondecl / \| \ function T B / \| \ f param .. param / \ / \ x₁ T₁ x_n T_n	function f(x₁', .., x_n') { B }
inc \| V	V'++
dec \| V	V'--
assign / \ V E	V' = E'
break	break
return	return
return \| E	return E'
if / \ E B	if (E') { B' }
if / \| \ E B₁ B₂	if (E') { B₁' } else { B₂'} // But no braces around B₂ if it’s an if-statement
while / \ E B	while (E') { B' }
repeat / \ E B	for (let i = 0; i < E'; i++) { B' } // (note: i is a new variable)
for-upto / \| \| \ x E₁ E₂ B	for (let x' = E₁'; x' < E₂'; x'++) { B' }
for-through / \| \| \ x E₁ E₂ B	for (let x' = E₁'; x' <= E₂'; x'++) { B' }
for-collection / \| \ x A B	for (let x' of A') { B' }
conditional / \| \ E E₁ E₂	((E') ? (E₁') : (E₂'))
+ / \ E₁ E₂	(E₁' + E₂) // same for all other binary ops // translated where necessary (e.g. == → ===)
- \| E	(- E') // same for all other unary ops // except "some", which is a no-op in JS
emptyoptional	undefined
subscript / \ A E	A'[E']
arrayexpression / \ E₁ ..... E_n	[E₁', ...., E_n']
member / \ O F	O'["F'"] // quotes are important
fn-call / \| \ F E₁ .. E_n	F'(E₁', ...., E_n') // Standard Library functions handled inline
ctor-call / \| \ C E₁ .. E_n	new C'(E₁', ...., E_n')

Virtual Machines and Backend Language Libraries

LLVM

LLVM is, well, it’s huge, it’s impressive, and lots of compilers for modern languages are using it. It is a library with a ton of stuff, including an intermediate language and backends. It’s official name is The LLVM Compiler Infrastructure Project.

A take on modern compiler construction from Andi McClure:

Exercise: Read the whole blog post from whence came that awesome picture.

JVM

Read about the JVM at Wikipedia and about its formal specification.

Web Assembly

Read about Web Assembly

Exercise: Explain WASM, briefly, in your own words.

Write your own VM!

SERIOUSLY. DO IT.

Targeting a Typical Processor

Okay let’s say you’re going to bypass LLVM or anything else and you are going straight to assembly. Or, let’s say you are joining the LLVM team and want to enhance the project. Or you want to write a new backend for an intermediate representation. Lots of good stuff to know. Here are some notes that just scratch the surface of generating assembly language.

Address Assignment

TODO - Fortran no recursion local vars, even return val on stack

TODO - C, show assembly for recursive function

TODO - C, show space for local vars

TODO - motivate stack frames

TODO - motivate dynamic links

TODO - Three approaches to static scope

TODO - Static link

TODO - Single display

TODO - Display in every frame

TODO - No need for display or static link in leaf subroutine

TODO - Who saves the registers

TODO - Inlining?

TODO - Summary of questions and concerns

Instruction Selection

TODO

Register Allocation

TODO

Code Generator Generators

TODO

Summary

We’ve covered: