A program is a linguistic representation of a computation:
An interpreter is a program whose input is a program $P$ and some input data $x$; it executes $P$ on $x$:
That is, $I(P,x) = P(x)$.
Interpreters can be implemented in hardware or software. For example, a CPU is a (hardware) interpreter for machine language programs.
The Universal Interpreter 🤯🤯🤯
The fact that there exists a single interpreter capable of executing all possible programs (those representable in a formal system) is a very deep and significant discovery of Turing’s. It was, and perhaps still, is, utterly mind blowing.
It’s too expensive (not to mention ridiculously difficult) to write interpreters for high-level languages, so we generally translate them into something more easily interpretable. A translator is a program which takes as input a program in some language $S$ and outputs a program in a language $T$ such that the two programs have the same semantics.
That is, $\forall x.\,P(x) = Q(x)$.
Some translators have really strange names:
If we translate the entirety of a program into something interpretable, we call this ahead-of-time (AOT) compilation. AOT compilers may be chained, the last of which is often an existing assembler, like this:
When we compile portions of a program while other previously compiled portions are running, we have just-in-time (JIT) compilation. JIT compilers will often remember what they already compiled instead of compiling the same source over and over again. They can even do adaptive compilation and recompilation based on runtime behavior.
It gets even more interesting
It actually isn’t true that compilation, whether AOT or JIT, is completely outside of your program. Many languages allow you to execute code during compilation time and compile new code at run time: think of macros in Julia or Clojure, Perl’s BEGIN block, or any language with the dreadedeval
.
Generally we write interpreters for simple, or low-level languages, whose programs tend to look like simple instruction sequences. As an example, let’s imagine a simple programming language for evaluating and printing the results of arithmetic expressions. Here are three example programs:
8 P
2 3 + N P
3 1 + P 0 15 * 100 25 2 * / P P
When you see a number, you push it on a stack. N negates the number on top. +, –, *, and / pop the two top elements, apply the operation, and push the result. P pops the value and prints it.
Before we implement the interpreter, execute the three programs above by hand so you understand the operation.
You did the classwork, right? Great! Now let’s write an interpreter:
function interpret(program) { const stack = [] function push(x) { stack.push(Number(x)) } function pop() { if (stack.length === 0) throw 'Not enough operands' return stack.pop() } for (const token of program.trim().split(/\s+/)) { let x, y if (/^\d+(\.\d+)?$/.test(token)) push(token) else if (token === 'N') (x = pop()), push(-x) else if (token === '+') (y = pop()), (x = pop()), push(x + y) else if (token === '-') (y = pop()), (x = pop()), push(x - y) else if (token === '*') (y = pop()), (x = pop()), push(x * y) else if (token === '/') (y = pop()), (x = pop()), push(x / y) else if (token === 'P') (x = pop()), console.log(x) else throw 'Illegal Instruction' } }Since the interpreter is a JavaScript program, you can just copy-paste it into your browser’s console and invoke the function on any “program” of your choice, for example
interpret("8 P")
.
When thinking about more sophisticated languages, we generally go through layers of translation before getting things down to something simple enough to interpret. That’s fine, but we need to think deeply about what translation even is before attempting to implement a translator.
First something inescapably true: translation consists of analysis and generation phases.
There are two reasons for this:
In practice it is rare for a conceptual representation to exist that is expressive enough and powerful enough to capture every imaginable source language and every imaginable target. Some systems do manage, at least, to get a lot of the common ones.
How successful is machine translation for natural languages? Might it work best when the two languages are close cousins or siblings in an evolutionary language tree? Whether or not that’s the case, it’s often illuminating to play translation telephone. Here is one run I made using Google Translate:
With idioms, sayings, and slang, you only need two languages to get something funny (the following really happened on Google Translate, but has since been fixed):
“Screenshot or it didn’t happen”:
Machine translation for natural languages is getting much better. They use a lot of statistical methods, machine learning, dictionaries and lots of interesting techniques. We’re looking for great translations, because perfection is not always possible.
Fun read: Enjoy this article from December, 2016, about issues in natural language translation, and when computers started to get really good at it.
Programming languages tend to be free of the ambiguities and nuances of natural languages, so translations can be in most cases 100% precise, and even formally verified. How do these systems look in practice? Decades of research into translation techniques have yielded some useful knowledge.
Compilers for programming languages go through a lot of phases, some have several dozen. In the so-called “classic compiler” (a standalone compiler that does everything from scratch), you tend to see the following major phases:
What do these phases look like in practice? Here’s a super-simplified example, just to get a taste for what is going on, using this C code:
#define ZERO 0 unsigned gcd( unsigned int // Euclid’s algorithm x,unsigned y) { while ( /* hello */ x> ZERO ){unsigned oldx=x;x=y %x;y = oldx ;}return y ;}
The lexer converts the source character stream into tokens. In doing this it needs to do things like remove comments, expand source-level macros, deal with indentation (for whitespace-significant languages like Python and Haskell), and remove whitespace. For the C code above, the lexical analyzer produces the token sequence:
unsigned
ID(gcd)
(
unsigned
int
ID(x)
,
unsigned
ID(y)
)
{
while
(
ID(x)
>
INTLIT(0)
)
{
unsigned
ID(oldx)
=
ID(x)
;
ID(x)
=
ID(y)
%
ID(x)
;
ID(y)
=
ID(oldx)
;
}
return
ID(y)
;
}
To build a token stream, lexers have to take into account case sensitivity, whitespace and newline significance, comment nesting, the actual source language alphabet, any weird limits on identifier lengths. They need to be able to detect whether character or string literals are unclosed, or whether a file ends while inside a comment. If a token stream cannot be formed, the lexer emits a lexical error.
Next, the parser applies the grammar rules to the token sequence to produce the abstract syntax tree. In our running example, the AST is:
Parsers, like lexers, are often automatically generated from a grammar by special tools, called parser generators, but you can write them by hand if needed.
Errors that occur during parsing, called syntax errors arise from malformed token sequences such as:
j = 4 * (6 − x;
i = /5
42 = x * 3
Programming language grammars are almost always context free, so things like multiple declarations of a variable within a scope, referencing a variable before its (or without a) declaration, access violations, too many or too few arguments, or type checking failures, are checked after grammar matching, by the static analyzer. This produces the conceptual representation of the front end, which for our running example, is:
Usually, an intermediate code generator transforms the graph representation of the program into something “flatter” like an abstract instruction list. For our running example, something like:
L1: if x > 0 goto L2 oldx := x t1 := y % x x := t1 y := oldx goto L1 L2: return y
Here t1, t2, ... are virtual registers. At this stage of compilation, we assume an infinite number of these.
Also at this stage, we can look for an apply various optimizations such as constant folding and strength reductions.
The intermediate code is often broken up into chunks known as basic blocks and assembled into a flow graph. Here’s our example:
We use the flow graph for control flow and data flow analysis, which helps with fancier optimizations like copy propagation, dead code elimination, induction variable simplification, loop fission and fusion, and more.
Classic compilers target machines with limited registers and fancy instruction sets, which have all sorts of arcane conventions involving register usage, memory allocation to manage function calls, and so much more. Code generators need to have intimate knowledge of processor architecture, so they can generate good code. In practice, code generators can be tuned to produce raw code (that is as close as possible to the source code) or “optimized” code. Here is unoptimized and optimized code for our running example on the ARM processor:
gcd(unsigned int, unsigned int): sub sp, sp, #32 str w0, [sp, 12] str w1, [sp, 8] .L3: ldr w0, [sp, 12] cmp w0, 0 beq .L2 ldr w0, [sp, 12] str w0, [sp, 28] ldr w0, [sp, 8] ldr w1, [sp, 12] udiv w2, w0, w1 ldr w1, [sp, 12] mul w1, w2, w1 sub w0, w0, w1 str w0, [sp, 12] ldr w0, [sp, 28] str w0, [sp, 8] b .L3 .L2: ldr w0, [sp, 8] add sp, sp, 32 ret
gcd(unsigned int, unsigned int): ; x in w0, y in w1 cbz w0, .L2 ; if x == 0 goto L2 .L3: udiv w2, w1, w0 ; w2 := y / x msub w2, w2, w0, w1 ; w2 := y mod x mov w1, w0 ; y := x mov w0, w2 ; x := w2 cbnz w2, .L3 ; if x != 0 goto top .L2: mov w0, w1 ; return y ret
Now how many of these phases do people really write? Certainly not all of them, right? Aren’t there tools that can help?
Unless you are a student or hobbyist looking for a deep understanding of compiler technology, you don’t duplicate all the hard work that people have already put into middle and back ends. Most people these days make their front end target the LLVM system, which supplies middle and backend functionality for a large number of architectures. LLVM is so popular these days that Andi McClure created a helpful diagram for y'all:
You don’t have to target LLVM of course. You can target other systems like JVM (the Java Virtual Machine) if you want to interoperate with the massive Java ecosystem, or CIL, an intermediate language on the .NET platform. Heck, you can even simply target C or JavaScript and take advantage of the amazing existing JavaScript engines and existing C compilers others have created already.
You might think just hitting C is the best. But why is LLVM so popular? LLVM is more than just a set of compiler phases, it’s a huge library of reusable components that allow all kinds of language processing. In fact, it’s generally preferable to use LLVM when implementing a language rather than simply translating to C:
... translate the input source to C code (or some other language) and send it through existing C compilers. This allows reuse of the optimizer and code generator, gives good flexibility, control over the runtime, and is really easy for front-end implementers to understand, implement, and maintain. Unfortunately, doing this prevents efficient implementation of exception handling, provides a poor debugging experience, slows down compilation, and can be problematic for languages that require guaranteed tail calls (or other features not supported by C). — Chris Lattner, from the LLVM chapter in The Architecture of Open Source Applications
There are three languages involved in translation: the source, the target, and the host. We show them in T-shapes, with source on the left, target on the right, and host below. Examples:
Here are three kinds of compilers you should learn to identify:
You’ll want to study existing compilers to get a feel for what’s involved. There are texts and tutorials about writing little languages, and there are real compilers for industrial-strength languages. Interested in the latter? Take a look at these:
gcc -S
and view the generated assembly language in a .s
file.
julia --output-bc hello.jl
to see the LLVM.
swiftc -emit-sil hello.swift
for Swift Intermediate Language
swiftc -emit-ir hello.swift
for LLVM
swiftc -emit-assembly hello.swift
for the assembly language
go tool compile -S hello.go
shows you the assembly language.
javap -c Hello
shows you JVM code in the class file Hello.class
.
And don‘t miss the AWESOME Compiler Explorer which does live translations from C++, Rust, Go, and D, to x86 assembly and machine code.
Writing a traditional compiler will enter you into the elite club of compiler writers and make you feel accomplished. But remember: these days, compiler writing is more about language processing than building up that classic pipeline of phases, as explained in this video:
Please do write a compiler! Writing a compiler will:
In other words, compiler writing is almost the perfect practical activity to bring the most fundamental theories of our discipline to life.
Convinced yet?
If you’re still a little nervous, here are some nice reads that might get you excited to write compilers because (1) they are cool, and (2) they might not be as hard to write as everyone thinks they are:
In addition, please browse this nice list of resources and select a few items from it for self-study.
Here’s a tutorial.
Good question. You are probably worried about lectures that look like this (found here):
Or maybe you are worried about following advice like these answers on StackExchange.
Don’t worry. It doesn’t have to be this way. You will learn by doing. With your friends. Here’s a testimonial from a successful alum:
Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.
We’ve covered: