The first part of a compiler analyzes the source code into a structure that carries the meaning of the program; this structure is generally the abstract syntax tree that’s been checked and decorated. (Remember decorated means all identifier references have been resolved.)
From this structure we can generate the corresponding code in some other language, the target language. This is what a code generator does.
Some compilers generate twice: they first generate code in some “intermediate language” like SIL, LLVM IR, HIR, MIR, CIL, etc. Then they do the “real” code generation into a target language that is directly runnable (or really close to it), like virtual machine code, assembly language, or machine language.
Here are the three goals of any good code generator:
It should produce correct code
It should produce efficient code
It should itself be efficient
There is no one way to do code generation. There is no one target of a code generator. A code generator can produce:
Are there other kinds of code?
Also consider:
x = x + 1; mov rax, x_0 ⎫ add rax, 1 ⎬ just need inc x_0 mov x_0, rax ⎭ x = y + c; mov rax, y_5 z = x + 4; add rax, c_8 mov x_2, rax ; Redundant if x not used again mov rax, x_2 ; Redundant add rax, 4 mov z_7, rax
JavaScript makes a great target language? Why? It’s got first-class functions and async support, and that Node.js thing runs on that awesome V8. And the ecosystem! How can we target JS?
Start by browsing an existing example. This one, for the language Carlos, translates the decorated AST (Semantic Graph) directly into JavaScript (there’s no need to use an IR if you are targeting a high-level language):
// The code generator exports a single function, generate(program), which // accepts a program representation and returns the JavaScript translation // as a string. import { voidType, standardLibrary } from "./core.js" export default function generate(program) { // When generating code for statements, we'll accumulate the lines of // the target code here. When we finish generating, we'll join the lines // with newlines and return the result. const output = [] const standardFunctions = new Map([ [standardLibrary.print, x => `console.log(${x})`], [standardLibrary.sin, x => `Math.sin(${x})`], [standardLibrary.cos, x => `Math.cos(${x})`], [standardLibrary.exp, x => `Math.exp(${x})`], [standardLibrary.ln, x => `Math.log(${x})`], [standardLibrary.hypot, ([x, y]) => `Math.hypot(${x},${y})`], [standardLibrary.bytes, s => `[...Buffer.from(${s}, "utf8")]`], [standardLibrary.codepoints, s => `[...(${s})].map(s=>s.codePointAt(0))`], ]) // Variable and function names in JS will be suffixed with _1, _2, _3, // etc. This is because "switch", for example, is a legal name in Carlos, // but not in JS. So, the Carlos variable "switch" must become something // like "switch_1". We handle this by mapping each name to its suffix. const targetName = (mapping => { return entity => { if (!mapping.has(entity)) { mapping.set(entity, mapping.size + 1) } return `${entity.name}_${mapping.get(entity)}` } })(new Map()) const gen = node => generators?.[node?.kind]?.(node) ?? node const generators = { // Key idea: when generating an expression, just return the JS string; when // generating a statement, write lines of translated JS to the output array. Program(p) { p.statements.forEach(gen) }, VariableDeclaration(d) { // We don't care about const vs. let in the generated code! The analyzer has // already checked that we never updated a const, so let is always fine. output.push(`let ${gen(d.variable)} = ${gen(d.initializer)};`) }, TypeDeclaration(d) { // The only type declaration in Carlos is the struct! Becomes a JS class. output.push(`class ${gen(d.type)} {`) output.push(`constructor(${d.type.fields.map(gen).join(",")}) {`) for (let field of d.type.fields) { output.push(`this[${JSON.stringify(gen(field))}] = ${gen(field)};`) } output.push("}") output.push("}") }, StructType(t) { return targetName(t) }, Field(f) { return targetName(f) }, FunctionDeclaration(d) { output.push(`function ${gen(d.fun)}(${d.params.map(gen).join(", ")}) {`) d.body.forEach(gen) output.push("}") }, Variable(v) { // Standard library constants just get special treatment if (v === standardLibrary.π) return "Math.PI" return targetName(v) }, Function(f) { return targetName(f) }, Increment(s) { output.push(`${gen(s.variable)}++;`) }, Decrement(s) { output.push(`${gen(s.variable)}--;`) }, Assignment(s) { output.push(`${gen(s.target)} = ${gen(s.source)};`) }, BreakStatement(s) { output.push("break;") }, ReturnStatement(s) { output.push(`return ${gen(s.expression)};`) }, ShortReturnStatement(s) { output.push("return;") }, IfStatement(s) { output.push(`if (${gen(s.test)}) {`) s.consequent.forEach(gen) if (s.alternate?.kind?.endsWith?.("IfStatement")) { output.push("} else") gen(s.alternate) } else { output.push("} else {") s.alternate.forEach(gen) output.push("}") } }, ShortIfStatement(s) { output.push(`if (${gen(s.test)}) {`) s.consequent.forEach(gen) output.push("}") }, WhileStatement(s) { output.push(`while (${gen(s.test)}) {`) s.body.forEach(gen) output.push("}") }, RepeatStatement(s) { // JS can only repeat n times if you give it a counter variable! const i = targetName({ name: "i" }) output.push(`for (let ${i} = 0; ${i} < ${gen(s.count)}; ${i}++) {`) s.body.forEach(gen) output.push("}") }, ForRangeStatement(s) { const i = targetName(s.iterator) const op = s.op === "..." ? "<=" : "<" output.push(`for (let ${i} = ${gen(s.low)}; ${i} ${op} ${gen(s.high)}; ${i}++) {`) s.body.forEach(gen) output.push("}") }, ForStatement(s) { output.push(`for (let ${gen(s.iterator)} of ${gen(s.collection)}) {`) s.body.forEach(gen) output.push("}") }, Conditional(e) { return `((${gen(e.test)}) ? (${gen(e.consequent)}) : (${gen(e.alternate)}))` }, BinaryExpression(e) { const op = { "==": "===", "!=": "!==" }[e.op] ?? e.op return `(${gen(e.left)} ${op} ${gen(e.right)})` }, UnaryExpression(e) { const operand = gen(e.operand) if (e.op === "some") { return operand } else if (e.op === "#") { return `${operand}.length` } else if (e.op === "random") { return `((a=>a[~~(Math.random()*a.length)])(${operand}))` } return `${e.op}(${operand})` }, EmptyOptional(e) { return "undefined" }, SubscriptExpression(e) { return `${gen(e.array)}[${gen(e.index)}]` }, ArrayExpression(e) { return `[${e.elements.map(gen).join(",")}]` }, EmptyArray(e) { return "[]" }, MemberExpression(e) { const object = gen(e.object) const field = JSON.stringify(gen(e.field)) const chain = e.op === "." ? "" : e.op return `(${object}${chain}[${field}])` }, FunctionCall(c) { const targetCode = standardFunctions.has(c.callee) ? standardFunctions.get(c.callee)(c.args.map(gen)) : `${gen(c.callee)}(${c.args.map(gen).join(", ")})` // Calls in expressions vs in statements are handled differently if (c.callee.type.returnType !== voidType) { return targetCode } output.push(`${targetCode};`) }, ConstructorCall(c) { return `new ${gen(c.callee)}(${c.args.map(gen).join(", ")})` }, } gen(program) return output.join("\n") }
What did you notice from browsing these examples? A few tips and tricks, right? Things like:
generate
method for each AST node, and each node’s generate
method invokes the generate method of its sub nodes. The top-level function gen
dispatches to the right generate methodHere is a view of some of the Carlos functions in table form:
Graph Fragment | JavaScript |
---|---|
program / \ s1 .. sn |
s1'; . . sn'; |
vardecl / | \ v T E |
let v' = E' |
structtypedecl / \ t structtype / \ field .. field / \ / \ x1 T1 xn Tn |
class t' { constructor (x1', .., xn') { this.x1' = x1'; ... this.xn' = xn'; } } |
functiondecl / | \ function T B / | \ f param .. param / \ / \ x1 T1 xn Tn |
function f(x1', .., xn') { B } |
inc | V |
V'++ |
dec | V |
V'-- |
assign / \ V E |
V' = E' |
break |
break |
return |
return |
return | E |
return E' |
if / \ E B |
if (E') { B' } |
if / | \ E B1 B2 |
if (E') { B1' } else { B2'} // But no braces around B2 if it’s an if-statement |
while / \ E B |
while (E') { B' } |
repeat / \ E B |
for (let i = 0; i < E'; i++) { B' } // (note: i is a new variable) |
for-upto / | | \ x E1 E2 B |
for (let x' = E1'; x' < E2'; x'++) { B' } |
for-through / | | \ x E1 E2 B |
for (let x' = E1'; x' <= E2'; x'++) { B' } |
for-collection / | \ x A B |
for (let x' of A') { B' } |
conditional / | \ E E1 E2 |
((E') ? (E1') : (E2')) |
+ / \ E1 E2 |
(E1' + E2) // same for all other binary ops // translated where necessary (e.g. == → ===) |
- | E |
(- E') // same for all other unary ops // except "some", which is a no-op in JS |
emptyoptional |
undefined |
subscript / \ A E |
A'[E'] |
arrayexpression / \ E1 ..... En |
[E1', ...., En'] |
member / \ O F |
O'["F'"] // quotes are important |
fn-call / | \ F E1 .. En |
F'(E1', ...., En') // Standard Library functions handled inline |
ctor-call / | \ C E1 .. En |
new C'(E1', ...., En') |
LLVM is, well, it’s huge, it’s impressive, and lots of compilers for modern languages are using it. It is a library with a ton of stuff, including an intermediate language and backends. It’s official name is The LLVM Compiler Infrastructure Project.
A take on modern compiler construction from Andi McClure:
Read about the JVM at Wikipedia and about its formal specification.
Read about Web Assembly
SERIOUSLY. DO IT.
Okay let’s say you’re going to bypass LLVM or anything else and you are going straight to assembly. Or, let’s say you are joining the LLVM team and want to enhance the project. Or you want to write a new backend for an intermediate representation. Lots of good stuff to know. Here are some notes that just scratch the surface of generating assembly language.
TODO - Fortran no recursion local vars, even return val on stack
TODO - C, show assembly for recursive function
TODO - C, show space for local vars
TODO - motivate stack frames
TODO - motivate dynamic links
TODO - Three approaches to static scope
TODO - Static link
TODO - Single display
TODO - Display in every frame
TODO - No need for display or static link in leaf subroutine
TODO - Who saves the registers
TODO - Inlining?
TODO - Summary of questions and concerns
TODO
TODO
TODO
We’ve covered: