Introduction to Compilers

You know you want to write a compiler. Let’s get oriented first. A short introduction will make your upcoming tasks easier and more enjoyable.

Programs, Interpreters and Translators

A program is a linguistic representation of a computation:

program.png

An interpreter is a program whose input is a program $P$ and some input data $x$; it executes $P$ on $x$:

interpreter.png

That is, $I(P,x) = P(x)$.

Interpreters can be implemented in hardware or software. For example, a CPU is a (hardware) interpreter for machine language programs, and the following is a (software) interpreter for a little language:

stack-interpreter.js
function interpret(program) {
  const stack = []
  function push(x) {
    stack.push(Number(x))
  }
  function pop() {
    if (stack.length === 0) throw 'Not enough operands'
    return stack.pop()
  }
  for (const token of program.trim().split(/\s+/)) {
    let x, y
    if (/^\d+(\.\d+)?$/.test(token)) push(token)
    else if (token === 'N') (x = pop()), push(-x)
    else if (token === '+') (y = pop()), (x = pop()), push(x + y)
    else if (token === '-') (y = pop()), (x = pop()), push(x - y)
    else if (token === '*') (y = pop()), (x = pop()), push(x * y)
    else if (token === '/') (y = pop()), (x = pop()), push(x / y)
    else if (token === 'P') (x = pop()), console.log(x)
    else throw 'Illegal Instruction'
  }
}
Exercise: Can you figure out what the language is that is being interpreted here?
Exercise: Interpret the following programs:
  • 8 P
  • 2 3 + N P
  • 3 1 + P 0 15 * 100 25 2 * / P P
(Since the interpreter is a JavaScript program, you can just copy-paste it into your browser’s console and invoke the function on any “program” of your choice, for example interpret("8 P").) Can you figure out what N and P stand for?
The Universal Interpreter 🤯🤯🤯

The fact that there exists a single interpreter capable of executing all possible programs (those representable in a formal system) is a very deep and significant discovery of Turing’s. It was, and perhaps still, is, utterly mind blowing.

It’s too expensive (not to mention ridiculously difficult) to write interpreters for high-level languages, so we generally translate them into something more easily interpretable. A translator is a program which takes as input a program in some language $S$ and outputs a program in a language $T$ such that the two programs have the same semantics.

translator.png

That is, $\forall x.\,P(x) = Q(x)$.

Some translators have really strange names:

Assembler
A translator from assembly language programs to machine language programs.
Compiler
A translator from “high-level” language programs into programs in a lower-level language.
Transpiler
A translator from high-level language programs into programs in a different high-level language.

If we translate the entirety of a program into something interpretable, we call this ahead-of-time (AOT) compilation. AOT compilers may be chained, the last of which is often an existing assembler, like this:

staticcompilation.png

When we compile portions of a program while other previously compiled portions are running, we have just-in-time (JIT) compilation. JIT compilers will often remember what they already compiled instead of compiling the same source over and over again. They can even do adaptive compilation and re-compilation based on runtime behavior.

It gets even more interesting

It actually isn’t true that compilation, whether AOT or JIT, is completely outside of your program. Many languages allow you to execute code during compilation time and compile new code at run time: think of macros in Julia or Clojure, Perl’s BEGIN block, or any language with the dreaded eval.

Translation Phases

Translation consists of analysis and generation phases.

analsyn.png

There are two reasons for this:

  1. We can’t help but doing it any other way. Word for word translation just doesn’t work.
    Exercise: Give the most outrageous example you can think of where word-for-word translation between two natural languages comes out “so wrong.”
  2. It makes good engineering sense: if we need to write translators for $M$ source languages and $N$ target languages, we need only write $M + N$ simple programs (half-compilers) rather than writing $M \times N$ complex programs (full-compilers).

    frontendbackend.png

In practice it is rare for a conceptual representation to exist that is expressive enough and powerful enough to capture every imaginable source language and every imaginable target. Some systems do manage, at least, to get a lot of the common ones.

Natural Language Translation

How successful is machine translation for natural languages? Might it work best when the two languages are close cousins or siblings in an evolutionary language tree? Whether or not that’s the case, it’s often illuminating to play translation telephone. Here is one run I made using Google Translate:

  1. (English) She's feeling under the weather so will work from home today, with a laptop and cup of tea
  2. (Afrikaans) Sy gevoel onder die weer so sal werk van die huis af vandag, met 'n skootrekenaar en koppie tee
  3. (Hindi) वह मौसम के अंतर्गत महसूस किया तो एक लैपटॉप और एक कप चाय के साथ आज घर से काम करेंगे
  4. (Bulgarian) Той не се чувства добре с лаптоп и чаша чай ще работят от дома си днес
  5. (Filipino) Hindi niya pakiramdam komportable sa isang laptop at isang tasa ng tsaa ay trabaho mula sa bahay sa araw na ito
  6. (Sudanese) Anjeunna teu ngarasa nyaman jeung laptop jeung cangkir tea gawe ti imah ayeuna
  7. (Kazakh) Ол бар ноутбук жайлы сезінетін жоқ, бір кесе шай бүгін үйден
  8. (Chinese) 他不舒服的感覺與筆記本電腦,從家裡杯茶今天的工作
  9. (Telugu) ల్యాప్టాప్ తన అసౌకర్యం, నేడు ఇంటి నుండి పని, టీ ఒక కప్పు
  10. (English) His discomfort with a laptop, working from home today, a cup of tea

With idioms, sayings, and slang, you only need two languages to get something funny (the following really happened on Google Translate, but has since been fixed):

  1. (English) later alligator
  2. (Arabic) تمساح بيلحقك اركضي ولي هبلة
  3. (English) A crocodile hit you with my knees

“Screenshot or it didn’t happen”:

lateralligator1.png

lateralligator2.png

Exercise: Run some translation telephone iterations of your own. Use English language idioms. Share your own funny results online.

Machine translation for natural languages are getting much better. They use a lot of statistical methods, machine learning, dictionaries and lots of interesting techniques. We’re looking for great translations, because perfection is not always possible.

Fun read: Enjoy this article from December, 2016, about issues in natural language translation, and when computers started to get really good at it.

Exercise: Do some research into some major challenges still faced in natural language translation. One that comes to mind is gendered pronouns (as illustrated in the above telephone example). What kind of strategies can be employed in such cases?

Programming Language Translation

Programming languages tend to be free of the ambiguities and nuances of natural languages, so translations can be in most cases 100% precise, and even formally verified. How do these systems look in practice? Decades of research into translation techniques have yielded some useful knowledge.

Compilers for programming languages go through a lot of phases, some have several dozen. In the so-called “classic compiler” (a standalone compiler that does everything from scratch), you tend to see the following major phases:

compiler-phases.png

Wait What? The MIDDLE End?

That’s right. There’s a middle end. Anyone designing their own programming language writes the front end to analyze their language into an intermediate representation. This may not be the “best” possible representation so the middle end makes several passes over it to improve the code, my making it faster, more space-efficient, or easier to process by the backend. So our original picture is better like this:

threephases.png

In practice, unless you are a student or hobbyist looking for a deep understanding of compiler technology, you don’t duplicate all the hard work that people have already put into middle and back ends. Most people these days make their front end target the LLVM system, which supplies middle and backend functionality for a large number of architectures. LLVM is so popular these days that Andi McClure created a helpful diagram for y'all:

llvm-humor.png

You don’t have to target LLVM of course. You can target other systems like JVM (the Java Virtual Machine) if you want to interoperate with the massive Java ecosystem, or CIL, and intermediate language on the .NET platform. Heck, you can even simply target C or JavaScript and take advantage of the amazing existing JavaScript engines and existing C compilers others have created already.

frontends.png

You might think just hitting C is the best. But why is LLVM so popular? LLVM is more than just a set of compiler phases, it’s a huge library of reusable components that allow all kinds of language processing. In fact, it’s generally preferable to use LLVM when implementing a language rather than simply translating to C:

... translate the input source to C code (or some other language) and send it through existing C compilers. This allows reuse of the optimizer and code generator, gives good flexibility, control over the runtime, and is really easy for front-end implementers to understand, implement, and maintain. Unfortunately, doing this prevents efficient implementation of exception handling, provides a poor debugging experience, slows down compilation, and can be problematic for languages that require guaranteed tail calls (or other features not supported by C). — Chris Lattner, from the LLVM chapter in The Architecture of Open Source Applications

Compiler Writing in Practice

There are three languages involved in translation: the source, the target, and the host. We show them in T-shapes, with source on the left, target on the right, and host below. Examples:

tdiagrams.png

Here are three kinds of compilers you should learn to identify:

Exercise: Work out, or research, then explain how to use a self-compiler and a cross-compiler to port a compiler to an architecture that has no compiler.
Exercise: A compiler is called an optimizing compiler if it works really hard to produce the most efficient target code that is feasibly possible. What can you use an optimizing self-compiler for?

Let’s See Some Examples

You’ll want to study existing compilers to get a feel for what’s involved. There are texts and tutorials about writing little languages, and there are real compilers for industrial-strength languages. Interested in the latter? Take a look at these:

Exercise: Experiment with at least three of the above.

And don‘t miss the AWESOME Compiler Explorer which does live translations from C++, Rust, Go, and D, to x86 assembly and machine code.

Exercise: Experiment with the Compiler Explorer.

But wait, it’s not just about the compiler!

Write a traditional compiler will enter you into the elite club of compiler writers and make you feel accomplished. But remember: these days, compiler writing is more about language processing than building up that classic pipeline of phases, as explained in this video:

Why Should I Care?

Please do write a compiler! Writing a compiler will:

  1. Make you a better programmer in general, as you will better understand a language’s intricacies and obscurities. You will even start making sense of those ridiculous error messages from other people’s compilers.
  2. Help you write preprocessors, linkers, loaders, debuggers, profilers, syntax highlighters, text formatters, code coverage tools, database query languages, and IDEs, or your very own command processors, message scanners, etc., since you go through many of the same tasks you do in writing a compiler. Think you won’t ever write such things? A lot of people think that at first, but end up doing so.
  3. Improve your knowledge of many related technologies: generalized optimization tasks, advanced computer architectures, graphics interfaces, device controllers, virtual machines, and the translation of human languages.
  4. Make you a better computer scientist, because compiler technology spans so many areas of the discipline, including:
    • Formal Language Theory
      • Grammars
      • Parsing
      • Computability
    • Programming Languages
      • Syntax
      • Semantics (both static and dynamic)
      • Support for Paradigms (OOP, functional, logic, stack-based)
      • Details of parallelism, metaprogramming, etc.
      • Abstract languages and virtual machines
      • Concepts, such as: sequencing, conditional execution, iteration, recursion, functional decomposition, modularity, synchronization, metaprogramming, binding, scope, extent, volatile, const, subtype, pattern matching, type inference, closures, prototypes, introspection, instrumentation, annotations, decorators, memoization, traits, streams, monads, actors, mailboxes, comprehensions, continuations, wildcards, promises, regular expressions, proxies, transactional memory, inheritance, polymorphism, parameter modes, etc.
    • Computer Architecture
      • Instruction sets
      • RISC vs. CISC
      • Pipelines
      • Cores
      • Clock cycles, etc.
    • Algorithms and Data Structures
      • Regular expressions
      • Parsing algorithms
      • Graph algorithms
      • Union-find
      • Dynamic programming
      • Learning
    • Software Engineering (because compilers are generally large and complex)
      • Locality
      • Caching
      • Componentization, APIs, Reuse
      • Synchronization
  5. Give you practice designing and building large systems, as you grapple with choices and weigh tradeoffs such as:
    • Should I make a one-shot translator or make several layered independent “translation passes”?
    • Should it be quick-and-dirty or take its time generating amazing (optimized) code?
    • Should it stop at the first error? When should it stop? Should it...autocorrect?!
    • What parts should I build myself and when should I use tools?
    • Is retargetability important?
    • Should I just write the front-end only (say, targeting LLVM, RTL, or the JVM)? Or maybe just do source-to-source transpilation to C or JavaScript? Or go all the way to assembly, or machine code?
  6. Make you a better language designer. People might not use your language, no matter how cool, if it can’t be implemented efficiently. You’ll learn the impact of type systems, preprocessors, garbage collection and the like on both compile-time and run-time performance.
  7. Give you cred in the Programming Languages Community. A compiler course is really just a Programming Languages II course. In Programming Languages I you learn about specifying and using languages. In Programming Languages II you learn about designing and implementing languages. Implementation is interpretation and translation.
  8. Give you street cred in general. “Woah, you wrote a compiler?!”

Convinced yet?

Some Good Reads

If you’re still a little nervous, here are some nice reads that might get you excited to write compilers because (1) they are cool, and (2) they might not be as hard to write as everyone thinks they are:

In addition, please browse this nice list of resources and select a few items from it for self-study.

Here’s a tutorial.

Should I Take A Compiler Class?

Good question. You are probably worried about lectures that look like this (found here):

bad-compiler-lecture.png

Or maybe you are worried about following advice like these answers on StackExchange.

Don’t worry. It doesn’t have to be this way. You will learn by doing. With your friends. Here’s a testimonial from a successful alum:

whycompiler.png

Summary

We’ve covered:

  • What a program is, what an interpreter is, and what a compiler is
  • The difference between AOT and JIT compilation
  • Why translation has analysis and generation phases
  • A simplified architecture of a classic compiler
  • Real compilers often use pre-existing solutions for their middle and back ends
  • Places you can go to study and experiment with real compilers
  • Many good reasons to learn about compilers and write your own