Introduction to Parsing Theory

Compilers need to parse. A parser is just code, so to build one, we could just wing it. But a theory helps us understand parsing at a deep level, enabling us to write better, faster, and provably correct parsers. Let’s learn!

What is Parsing Again?

Parsing is the process of uncovering the derivation, or derivation tree, of a string from a grammar. We want to get the parse tree because a lot of meaning is often encoded in the structure of the string.

For example, given the grammar:

$\begin{array}{lcl} \mathit{pal} & \longrightarrow & \varepsilon \mid \texttt{"a"} \mid \texttt{"b"} \mid \texttt{"a"}\;\mathit{pal}\;\texttt{"a"} \mid \texttt{"b"}\;\mathit{pal}\;\texttt{"b"} \\ \end{array}$

we parse the string $abbba$ into the derivation tree:

pal a pal a b pal b b

In a real programming language, we generally don’t make a derivation tree for each character in the source program. Instead, we first group characters into tokens and we consider the tokens as the symbols.

Program Statement print identifier ;

The pre-processing of the source program via tokenization is so common and so natural and so important that there is a kind of separate (simpler) theory and practice that deals with this, called lexical analysis. Let’s briefly talk about that first before getting into the richer theory of parsing.

Lexical Analysis

Tokenization, or lexical analysis, is the process of forming tokens out of the source code character stream. You basically look at the character stream character-by-character, skipping whitespace and comments, and collecting characters into tokens. Sometimes you might need to “look ahead” a little, or “back up” as you go through the character stream.

Generally, each token will have a type, a lexeme, and line and column position. Types of tokens include:

So, for the source program

  print( "It is" // ⿇🌿
420.0 + start  );

we would expect a token stream like this:

{category: "kwd", lexeme: "print", line: 1, column: 3}
{category: "sym", lexeme: "(", line: 1, column: 8}
{category: "str", lexeme: "\"It is\"", line: 1, column: 10}
{category: "num", lexeme: "420.0", line: 2, column: 1}
{category: "sym", lexeme: "+", line: 2, column: 7}
{category: "ide", lexeme: "start", line: 2, column: 9}
{category: "sym", lexeme: ")", line: 2, column: 16}
{category: "sym", lexeme: ";", line: 2, column: 17}

Many tokens are formed from a single character. Those that aren’t need precise rules for their specification. Negative lookahead is common in these rules! Here are some examples:

num
id
idchar
print

Tokens are separated by spaces and comments, so it is conventional to define them with the lexical rules:

space
Tokenization is associated with Regular Languages

Traditionally, throughout the first 50+ years of computer science being computer science, people often described the workings of a lexer with Finite Automata. Building an FA for a lexical specification is not something you’d really ever do in practice. It’s just fun to know that you can do it...or at least almost do it, since lookahead and negative lookahead and backtracking really complicate things. It’s something you are very likely to find in a compiler book!

Writing a lexer by hand is possible. Once you have a string of characters, you’d do something like this (for a very small programming language):

const KEYWORDS = ["print", "if", "else", "while", "for", "function", "return"]

// Input: an array of characters with a newline in the last slot
function* tokenizeLine(lineNumber, line) {
  for (let i = 0; i < line.length; ) {
    // Skip spaces
    while (/[ \t\r]/.test(line[i])) i++

    // Done if at end or starting a comment
    if (line[i] === "\n" || line[i] + line[i + 1] === "//") break

    // Gather up the lexeme from start..i
    let category
    let start = i++
    if (/\p{L}/u.test(line[start])) {
      while (/[\p{L}\d_]/u.test(line[i])) i++
      category = "Id"
      if (KEYWORDS.includes(line.slice(start, i).join(""))) {
        category = "Kwd"
      }
    } else if (/\d/.test(line[start])) {
      while (/\d/.test(line[i])) i++
      if (line[i] === ".") {
        i++
        if (!/\d/.test(line[i])) {
          error(`Digit expected`, { line: lineNumber, column: i + 1 })
        }
        while (/\d/.test(line[i])) i++
      }
      if (line[i] === "E" || line[i] === "e") {
        i++
        if (line[i] === "+" || line[i] === "-") i++
        if (!/\d/.test(line[i])) {
          error(`Digit expected`, { line: lineNumber, column: i + 1 })
        }
        while (/\d/.test(line[i])) i++
      }
      category = "Num"
    } else if (/[-+*/%=<>!,;()]/.test(line[start])) {
      if (line[start] + line[i] === "**") i++
      if (line[start] + line[i] === "<=") i++
      if (line[start] + line[i] === ">=") i++
      if (line[start] + line[i] === "!=") i++
      category = "Sym"
    } else {
      error(`Unexpected character: '${line[start]}'`, { line: lineNumber, column: start })
    }

    // Compensate for column beginning at 1, not 0
    yield new Token(category, line.slice(start, i).join(""), lineNumber, start + 1)
  }
}
Exercise: There are two things this mechanism does not do well: string interpolation and multi-line strings. Can you figure out how to modify the code to handle these cases?

In general, lexers are concerned with issues such as:

Errors that might occur during scanning, called lexical errors include:

Approaches to Parsing

Once you have a lexer written and deployed, you can now work entirely with token streams. Parsing now works entirely at the level of the phrase grammar. The parser’s job is to uncover the derivation tree, or parse tree, whose leaves are now tokens, not characters.

A Little Theory

A language’s syntax is normally defined with a CFG (or the equally powerful BNF, EBNF, ABNF, or Syntax Diagram), or a PEG. A compiler uses the grammar to parse the token sequence, that is, to determine if the token string can be generated by the grammar, and produce the syntax tree for the token sequence. People have been studying parsing for decades! Here are some things they figured out:

That’s some good background info to start a study of parsing. But wait, first, what do $LL$ and $LR$ mean?

$LL(k)$ Grammar
A grammar that can be used to drive a leftmost derivation by reading the input left-to-right, examining only the next $k$ symbols, and never backing up.
$LR(k)$ Grammar
A grammar that can be used to drive a rightmost derivation by reading the input left-to-right, examining only the next $k$ symbols, and never backing up.

Um, are there more practical descriptions of these language classes? Pretty much, yeah. Try these:

$LL(k)$
Locate all the “choice points” in the grammar (these are easy to find in the equivalent syntax diagram). Determine the maximum number of tokens, $k$, required to know exactly which path to take over all choice points. If $k$ is finite, the grammar is LL(k).
$LR(k)$
The set of $LR$ languages is exactly equivalent to the set of languages recognizable by a DPDA. $LR$ languages are exactly the DCFLs. An $LR(k)$ grammar is a CFG for a language that is a DCFL.

An $LL$ and $LR$ Fact Sheet:

LL GrammarsLR Grammars
  • Subset of CFGs
  • Never ambiguous
  • Parsable in Linear Time
  • Left-to-right, Leftmost derivation
  • Top-down
  • Predictive
  • Expand-Match
  • Parsers can be hand-made
  • Can give nice error messages
  • Can not handle left recursion
  • Less "powerful" than LR
  • Subset of CFGs
  • Never ambiguous
  • Parsable in Linear Time
  • Left-to-right, Rightmost derivation
  • Bottom-up
  • Not Predictive
  • Shift-Reduce
  • Nearly impossible to write parsers by hand
  • Must work hard to give good errors
  • Left recursion is desirable!
  • More "powerful" than LL

Wow! That was a lot. Details are coming. Read on.

Exercise: Why is it that a grammar with left-recursion can not be LL(k) for any k?
Exercise: For more on $LL$ and $LR$ see this BNF article with a great little section on parsing.

We’ve made syntax diagrams to describe all the legal programs in the language, but how to we follow the diagrams to “match” or “parse” a candidate program? That is, given a program (a candidate string), how, exactly, do we figure out how it is matched by, or generated by, the syntax diagrams? How do we parse?

There have been hundreds of books and thousands of papers on the topic of parsing over the years. Here we just need to highlight a few things.

Hand-Crafted Parsers

Because a parser is just a program that turns source code into a parse tree, you can just write on by hand. The most common kind of hand-crafted parser uses the technique of recursive descent, which you can read about at Wikipedia. You kind of walk through each of the syntax diagrams.

CLASSWORK
I’ve written a recursive descent parser as part of an Astro compiler. We’ll take a very brief look at it in class.

Parser Generators

A recursive descent parser generally requires a highly restricted syntax, one that is line with a class of languages known as the $LL(k)$ languages. These languages are a proper subset of the $LR(k)$ languages (also known as the deterministic context-free languages), which are used in popular parser generators such as YACC, Bison, and ANTLR, among others.

The approach here is to take a context free grammar and turn it into a table that directs the parser to examine the current state and the current symbol and take an action. The two main kinds of actions are expand-match (for $LL(k)$ parsers) and shift-reduce (for $LR(k)$ parsers).

While these systems are quite popular and famous, we won’t be covering them here. Do read about them, though.

Analytic Grammars

Another approach is to use a specialized kind of grammar designed specifically for parsing. So instead of using traditional generative grammars, covered previously in our notes on Language Theory (which specify a language by giving a set of rules to generate all and only those strings in a language), we can use an analytic grammar, in which the rules specify how to match a candidate string.

The biggest difference between generative and analytic grammars is in the handling of alternatives. In a generative grammar $G$, $L(G)$ is contains all strings that can be generated by applying rules whenever possible. For example, the rule $S \rightarrow A\,|\,B$ means you can replace an $S$ with an $A$ or a $B$, so you can generate both. So the grammar:

$s \longrightarrow \texttt{"a"} \mid \texttt{"ab"}$

describes the language $\{ a, ab\}$.

In an analytic grammar, the rules are deterministic; you don’t get to pick one or the other. So $A\,/\,B$ means: “Try to match $A$. If you can, the rule matches and you are done. Otherwise (and only otherwise), try to match $B\,$”. So:

$s \longleftarrow \texttt{"a"}\;/\;\texttt{"ab"}$

recognizes the language $\{ a \}$, but:

$s \longleftarrow \texttt{"ab"}\;/\;\texttt{"a"}$

recognizes the language $\{ a, ab \}$.

Note the tradeoff here. Determinism means there’s never ambiguity (a string can only parse one way, if it can at all), which is great! But the ordering of the rules matters, so you have to be careful to do what you want. Here’s another illustration. If we wanted an analytic grammar for Astro, the rule:

$\mathit{Primary} \longleftarrow \mathit{num}\;/\;\mathit{id}\;/\;\mathit{Call}\;/\;\ldots$

would be wrong, as it will never match a call!

Exercise: Why, exactly, must the Call alternative precede the id alternative?
CLASSWORK
This might be deep, so we’ll try a few more examples.

Unlike generative grammars, analytic grammars feature special lookahead operators, including the negative lookahead operator (~) and the positive lookahead operator (&). This enables the definition of non-context-free languages, despite having only one symbol on the left-hand side! Here’s an analytic grammar for the non-context-free language $\{ a^nb^nc^n \,\mid\, n \geq 0 \}$:

$ \begin{array}{lcl} \mathit{s} &\longleftarrow& \texttt{&} \; (\mathit{x}\texttt{ "c"}) \; \texttt{"a"}^+\;\mathit{y} \\ \mathit{x} &\longleftarrow& \texttt{"a"} \; \mathit{x}? \; \texttt{"b"} \\ \mathit{y} &\longleftarrow& \texttt{"b"} \; \mathit{y}? \; \texttt{"c"} \end{array} $

Parsing Expression Grammars (PEGs)

There are various formalisms of analytic grammars (Wikipedia identifies several); the most popular is the Parsing Expression Grammar (PEG). To read a PEG description, pronounce the operators as follows:

Note how PEGs use / instead of |, perhaps to emphasize the deterministic nature of matching.

Strictly speaking, it should be impossible in a PEG to have left-recursive rules like:

$\mathit{Exp} \; \longleftarrow \mathit{Exp} \; \texttt{"+"} \; \mathit{Term} \;\;/\;\; \mathit{Term}$

given that “in order to match an Exp, you much first try to match an Exp.” It’s sounds circular. In fact it is, and traditionally, PEG implementations suggested using the rule:

$\mathit{Exp} \; \longleftarrow \mathit{Term} \; (\texttt{"+"} \; \mathit{Term})^*$

instead. However, it turns out a proper parsing technique (using a packrat parser) can deal with it! It's not easy, but you can read the original paper by Alex Warth et al or this nice and detailed article by Guido van Rossum to see how it’s done.

Check your understanding

Analytic grammars:

  • Are designed for parsing (recognition) rather than generation of languages.
  • Are written with only a single variable on the left hand side of each rule.
  • Use $\longleftarrow$ rather than $\longrightarrow$ to define rules.
  • Execute alternatives left-to-right and once you get a match, you keep it, even if it leads to a dead end.
  • Have powerful lookahead operators, thanks to deterministic execution.
  • Can handle left-recursive rules provided special techniques are employed.

Try this exercise before moving on:

Exercise: Show that the following generative grammar for assignment statements and if-then-else statements in a programming language:
    S ⟶ "a" | "i" "x" "t" S ("e" S)?
is ambiguous, but that the ambiguity can be resolved by rewriting as a PEG:
    S ⟵ "a" / "i" "x" "t" S "e" S / "i" "x" "t" S
How, exactly, is the ambiguity resolved? (Historical note: this is the dangling else problem, a classic in parsing theory.)

Now let’s put everything together to come up with a Parsing Expression Grammar for Astro, that (1) properly separates lexical and phrase rules, (2) deals properly with precedence and associativity, (3) embraces left-recursion, and (4) correctly handles reserved words.

$ \begin{array}{lcl} \mathit{Program} &\longleftarrow& \mathit{Statement}^+ \\ \mathit{Statement} &\longleftarrow& \mathit{Assignment} \;/\; \mathit{PrintStmt} \\ \mathit{Assignment} &\longleftarrow& \mathit{id}\;\texttt{"="}\;\mathit{Exp}\;\texttt{";"} \\ \mathit{PrintStmt} &\longleftarrow& \mathit{print}\;\mathit{Exp}\;\texttt{";"} \\ \mathit{Exp} &\longleftarrow& \mathit{Exp} \;(\texttt{"+"}\,|\,\texttt{"–"})\; \mathit{Term} \;/\; \mathit{Term} \\ \mathit{Term} &\longleftarrow& \mathit{Term} \;(\texttt{"*"}\,|\,\texttt{"/"}\,|\,\texttt{"%"})\; \mathit{Factor} \;/\; \mathit{Factor} \\ \mathit{Factor} &\longleftarrow& \mathit{Primary}\;(\texttt{"**"}\;\mathit{Factor})? \;/\; \texttt{"–"} \; \mathit{Primary} \\ \mathit{Primary} &\longleftarrow& \mathit{num} \;/\; \mathit{Call} \;/\; \mathit{id} \;/\; \texttt{"("} \; \mathit{Exp} \; \texttt{")"} \\ \mathit{Call} &\longleftarrow& \mathit{id}\;\texttt{"("}\;(\mathit{Exp}\;(\texttt{","}\;\mathit{Exp})^*)?\;\texttt{")"} \\ \mathit{num} & \longleftarrow & \mathit{digit}^+ \; (\texttt{"."} \; \mathit{digit}^+)? \; ((\texttt{"E"} | \texttt{"e"}) \; (\texttt{"+"} | \texttt{"–"})? \;\mathit{digit}^+)? \\ \mathit{print} &\longleftarrow& \texttt{"print"}\;\neg\mathit{idchar}\\ \mathit{id} &\longleftarrow& \neg\mathit{print}\;\mathit{letter}\;\mathit{idchar}^*\\ \mathit{idchar} &\longleftarrow& \mathit{letter}\,|\,\mathit{digit}\,|\,\texttt{"_"}\\ \mathit{space}&\longleftarrow& \texttt{" "}\,|\,\texttt{"\t"}\,|\,\texttt{"\r"}\,|\,\texttt{"\n"}\,|\,\texttt{"//"}\;(\neg\texttt{"\\n"} \; \mathit{any})^* \end{array} $

For more about PEGs, see:

A popular library that builds parsers using PEGs is Ohm.

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

  1. What does a parser do?
    A parser takes uncovers the derivation tree of a string, according to the rules of a formal grammar (assuming the string is well-formed).
  2. Derivation trees for real programming languages do not go all the way down to individual characters. Where do they stop?
    They stop at tokens.
  3. What is the process of computing the tokens?
    Lexical analysis, lexing, or tokenization.
  4. What kind of things are tokens?
    Keywords, identifiers, literals, punctuation symbols, operators.
  5. In traditional compiler theory, tokenization is often described by what theoretical device?
    Finite Automata (FA)
  6. Why are Finite Automata not a great fit for describing real lexers?
    Because of lookahead, negative lookahead, and backtracking.
  7. What kind of language design issues matter during lexical analysis?
    Issues like context-sensitivity, which words are reserved, the significance of blanks, the significance of newlines, length limits on words, nestability of comments, source-level macros like you see in C, etc.
  8. Why are grammars for programming languages often context-free?
    The context-sensitive aspects of a language definition are ridiculously complex to the point of making it a terrible idea to use a context-sensitive grammar.
  9. Even context-free grammars take cubic time in general to parse? What subset of CFGs can be parsed in linear time?
    DCFGs (LR(k) grammars) and a subset of them, LL(k) grammars.
  10. What mechanisms underly LL(k) and LR(k) parsing?
    LL(k) uses expand-match, LR(k) uses shift-reduce.
  11. Which of LL or LR languages are top-down?
    LL.
  12. Which of LL or LR languages give better error messages?
    LL.
  13. Which of LL or LR languages can be made by hand?
    LL.
  14. Which of LL or LR languages are a larger family?
    LR.
  15. What is the most common type of hand-crafter parser?
    Recursive descent parser.
  16. What are some advantages of hand-crafted parsers over parsers generated by a machine?
    Hand-crafted parsers and be faster and give much more precise error messages.
  17. What are some applications and libraries that can generate parsers?
    YACC, Bison, ANTLR, Ohm
  18. What kind of grammar is Ohm based on?
    Parsing Expression Grammar (PEG)
  19. What is the difference between generative and analytic grammars?
    Generative grammars describe how strings in a language can be generated, while analytic grammars describe how strings can be parsed or recognized.
  20. What are major differences between the way generative and analytic grammars are constructed?
    (1) analytic grammars have ordered choice, (2) analytic grammars have lookahead, (3) rules in analytic grammars have nothing but a single variable on the left hand side.

Summary

We’ve covered:

  • What parsing is
  • Lexical analysis vs. syntax analysis
  • Various approaches to parsing
  • The method of recursive descent
  • Parsing based on Context-Free Grammars
  • Analytic Grammars
  • Parsing Expression Grammars