Introduction to Code Generation
What about the back half of the compiler?
What is Code Generation?
The first part of a compiler analyzes the source code into a structure that carries the meaning of the program; this structure is generally the abstract syntax tree that’s been checked and decorated. (Remember decorated means all identifier references have been resolved.)
From this structure we can generate the corresponding code in some other language, the target language. This is what a code generator does.
Some compilers generate twice: they first generate code in some “intermediate language” like SIL, LLVM IR, HIR, MIR, CIL, etc. Then they do the “real” code generation into a target language that is directly runnable (or really close to it), like virtual machine code, assembly language, or machine language.
Goals of Code Generation
Here are the three goals of any good code generator:
- It should produce correct code.
- It should produce efficient code.
- It should itself be efficient.
Exercise: The last two goals are seemingly in conflict. Discuss.
What Kind of Code Can We Generate?
There is no one way to do code generation. There is no one target of a code generator. A code generator can produce:
- Code in another high-level language. Don’t knock it. If the target high-level language has a good compiler already, this is a great strategy. Stand on the shoulders of giants right?
- C. Oh wait, C is a high-level language. Sort of. Wait no. No lambdas. But then you can let the C compiler take over to do the rest.
- An intermediate representation like LLVM, that has backends for dozens of architectures.
- Executable virtual machine code, such as for the JVM or BEAM. Targeting a virtual machine is a ton of fun. Learning about a VM and its instruction set is a great experience. You’ll get a great feeling of accomplishment. And you can run this code, too.
- Assembly language: Because it’s so fulfilling to learn about machine architecture, especially pipelines, caches, and of course, registers! You need to know your machine. Is it a RISC or CISC processor or some kind of novel architecture like a dataflow processor or somthing with hundreds of cores? What are the relative instruction speeds? How are its addressing modes? Does it have out-of-order instruction? Any cool idioms?
- Relocatable machine language: you’ve done a lot of assembly that the assembler would have done for you (resolving symbols), but at least you will let the linker do the rest.
- Absolute machine language: you’ve resolved all the memory locations; in other words you spent a lot of time doing what assemblers and linkers would do for you already.
Are there other kinds of code?
- Should we generate awesome code on the fly? Or let a separate optimizing pass clean up after our generator? Or not optimize at all? For example, naive code generation can lead to not so great code:
x = x + 1; mov rax, x_0 ⎫
add rax, 1 ⎬ just need inc x_0
mov x_0, rax ⎭
x = y + c; mov rax, y_5
z = x + 4; add rax, c_8
mov x_2, rax ; Redundant if x not used again
mov rax, x_2 ; Redundant
add rax, 4
mov z_7, rax
- Should we stick in line number information so a debugger can take over if our program is stopped intentionally (or crashes)?
Transpiling
JavaScript makes a great target language? Why? It’s got first-class functions and async support, and that Node.js thing runs on that awesome V8. And the ecosystem! How can we target JS?
Start by browsing code of simple examples. Each start with a decorated AST (there’s pretty much no need to use an IR if you are targeting a high-level language):
What did you notice from browsing these examples? A few tips and tricks, right? Things like:
- You write a
generate
method for each AST node.
- Each node’s generate method invokes the generate method of its sub nodes.
- When writing out high-level code, it is fine to over-parenthesize: you should probably not assume your associativities and precedences and arities and fixities are the same as those of your target language.
- It’s best to decorate the ids in the target language with uniquely identifying suffixes, because you never know whether scope rules are the same.
- While not shown in the examples above, character sets may very well be different between the two languages; keep that in mind.
- The code generator is a good place to inline some standard functions.
- But where standard functions cannot be inlined, your code generator can always write out the function implementation in the target language.
- It’s rather pleasant to use a very weakly typed, dynamic language for the target, so you don’t have to worry about constructing types in the target language. In fact if your source code language is statically typed, great! Do all the typechecking in your front end, then throw away all the types when you do your backend.
- There may be a ton of ad-hoc things that pop up, but these are fun challenges.
Virtual Machines and Backend Language Libraries
LLVM
LLVM is, well, it’s huge, it’s impressive, and lots of compilers for modern languages are using it. It is a library with a ton of stuff, including an intermediate language and backends. It’s official name is The LLVM Compiler Infrastructure Project.
A take on modern compiler construction from Andi McClure:

Exercise: Read the whole blog post from whence came that awesome picture.
JVM
Read about the JVM at Wikipedia and about its formal specification.
Web Assembly
Read about Web Assembly
Exercise: Explain WASM, briefly, in your own words.
Write your own VM!
SERIOUSLY. DO IT.
Targeting a Typical Processor
Okay let’s say you’re going to bypass LLVM or anything else and you are going straight to assembly. Or, let’s say you are joining the LLVM team and want to enhance the project. Or you want to write a new backend for an intermediate representation. Lots of good stuff to know. Here are some notes that just scratch the surface of generating assembly language.
Address Assignment
TODO - Fortran no recursion local vars, even return val on stack
TODO - C, show assembly for recursive function
TODO - C, show space for local vars
TODO - motivate stack frames
TODO - motivate dynamic links
TODO - Three approaches to static scope
TODO - Static link
TODO - Single display
TODO - Display in every frame
TODO - No need for display or static link in leaf subroutine
TODO - Who saves the registers
TODO - Inlining?
TODO - Summary of questions and concerns
Instruction Selection
TODO