Parsing is the process of uncovering the derivation, or derivation tree, of a string from a grammar.
For example, given the grammar:
$\begin{array}{lcl} \mathit{pal} & \longrightarrow & (\texttt{"a"}\;|\;\texttt{"b"}\;|\;\texttt{"a"}\;\mathit{pal}\;\texttt{"a"}\;|\;\texttt{"b"}\;\mathit{pal}\;\texttt{"b"})? \\ \end{array}$
we parse the string $abbba$ into the derivation tree:
pal / | \ / | \ a pal a / | \ / | \ b pal b | | b
In a real programming language, we generally don’t make a derivation tree for each character in the source program. Instead, we first group characters into tokens and we consider the tokens as the symbols.
TODO
The pre-processing of the source program via tokenization is so common and so natural and so important that there is a kind of separate (simpler) theory and practice that deals with this, called lexical analysis. Let’s briefly talk about that first before getting into the richer theory of parsing.
Tokenization, or lexical analysis, is the process of forming tokens out of the source code character stream. You basically look at the character stream character-by-character, skipping whitespace and comments, and collecting characters into tokens. Sometimes you might need to “look ahead” a little, or “back up” as you go through the character stream.
Generally, each token will have a type, a lexeme, and line and column position. Types of tokens include:
The lexical grammar is the part of the language specification that shows how tokens are formed from characters. Negative lookahead is common! An example might be:
TODO
Sometimes people find it helpful to visualize how a lexer works via a Finite Automata:
TODO
Writing a lexer by hand is possible. For the simple lexical grammar shown above, a hand-crafted lexer might be:
TODO
In general, lexers are concerned with issues such as:
Errors that might occur during scanning, called lexical errors include:
Once you have a lexer written and deployed, you can now work entirely with token streams. Parsing now works entirely at the level of the phrase grammar. The parser’s job is to uncover the derivation tree, or parse tree, whose leaves are now tokens, not characters.
A language’s syntax is normally defined with a CFG (or the equally powerful BNF, EBNF, Syntax Diagram), or a PEG. A compiler uses the grammar to parse the token sequence, that is, to determine if the token string can be generated by the grammar, and produce the syntax tree for the token sequence. People have been studying parsing for decades! Here are some things they figured out:
That’s some good background info to start a study of parsing. But wait, first, what do $LL$ and $LR$ mean?
Um, are there more practical descriptions of these language classes? Pretty much, yeah. Try these:
An $LL$ and $LR$ Fact Sheet:
LL Grammars | LR Grammars |
---|---|
|
|
Wow! That was a lot. Details are coming. Read on.
We’ve made syntax diagrams to describe all the legal programs in the language, but how to we follow the diagrams to “match” or “parse” a candidate program? That is, given a program (a candidate string), how, exactly, do we figure out how it is matched by, or generated by, the syntax diagrams? How do we parse?
There have been hundreds of books and thousands of papers on the topic of parsing over the years. Here we just need to highlight a few things.
Because a parser is just a program that turns source code into a parse tree, you can just write on by hand. The most common kind of hand-crafted parser uses the technique of recursive descent, which you can read about at Wikipedia. You kind of walk through each of the syntax diagrams.
I’ve written a recursive descent parser as part of an Astro compiler. We’ll take a very brief look at it in class.
A recursive descent parser generally requires a highly restricted syntax, one that is line with a class of languages known as the $LL(k)$ languages. These languages are a proper subset of the $LR(k)$ languages (also known as the deterministic context-free languages), which are used in popular parser generators such as YACC, Bison, and ANTLR, among others. While these systems are quite popular and famous, we won’t be covering them here. Do read about them, though.
We will see $LL(k)$ and $LR(k)$ later in the course, though.
Another approach is to use a specialized kind of grammar designed specifically for parsing. So instead of using traditional generative grammars, covered previously in our notes on Language Theory (which specify a language by giving a set of rules to generate all and only those strings in a language), we can use an analytic grammar, in which the rules specify how to match a candidate string.
The biggest difference between generative and analytic grammars is in the handling of alternatives. In a generative grammar $G$, $L(G)$ is contains all strings that can be generated by applying rules whenever possible. For example, the rule $S \rightarrow A\,|\,B$ means you can replace an $S$ with an $A$ or a $B$, so you can generate both. So the grammar:
$s \longrightarrow \texttt{"a"}\;|\;\texttt{"ab"}$
describes the language $\{ a, ab\}$.
In an analytic grammar, the rules are deterministic; you don’t get to pick one or the other. So $A\,/\,B$ means: “Try to match $A$. If you can, the rule matches and you are done. Otherwise (and only otherwise), try to match $B\,$”. So:
$s \longleftarrow \texttt{"a"}\;/\;\texttt{"ab"}$
recognizes the language $\{ a \}$, but:
$s \longleftarrow \texttt{"ab"}\;/\;\texttt{"a"}$
recognizes the language $\{ a, ab \}$.
Note the tradeoff here. Determinism means there’s never ambiguity (a string can only parse one way, if it can at all), which is great! But the ordering of the rules matters, so you have to be careful to do what you want. Here’s another illustration. If we wanted an analytic grammar for Astro, the rule:
$\mathit{Primary} \longleftarrow \mathit{num}\;/\;\mathit{id}\;/\;\mathit{Call}\;/\;\ldots$
would be wrong, as it will never match a call!
This might be deep, so we’ll try a few more examples.
Unlike generative grammars, analytic grammars feature special lookahead operators, including the negative lookahead operator (~
) and the positive lookahead operator (&
). This enables the definition of non-context-free languages, despite having only one symbol on the left-hand side! Here’s an analytic grammar for the non-context-free language $\{ a^nb^nc^n \,\mid\, n \geq 0 \}$:
$ \begin{array}{lcl} \mathit{s} &\longleftarrow& \texttt{&} \; (\mathit{x}\texttt{ "c"}) \; \texttt{"a"}^+\;\mathit{y} \\ \mathit{x} &\longleftarrow& \texttt{"a"} \; \mathit{x}? \; \texttt{"b"} \\ \mathit{y} &\longleftarrow& \texttt{"b"} \; \mathit{y}? \; \texttt{"c"} \end{array} $
There are various formalisms of analytic grammars (Wikipedia identifies several); the most popular is the Parsing Expression Grammar (PEG). To read a PEG description, pronounce the operators as follows:
A B
: match A then match BA / B
: match A. If it matched, success. Otherwise, match BA?
: match A zero or one timesA*
: match A zero or more timesA+
: match A one or more times&A
: match A without consuming it (but it must match)~A
: match A without consuming it. If A matches then ~A fails. If A fails ot match, then ~A succeedsend
: matches the end of the stringNote how PEGs use /
instead of |
, perhaps to emphasize the deterministic nature of matching.
Strictly speaking, it should be impossible in a PEG to have left-recursive rules like:
$\mathit{Exp} \; \longleftarrow \mathit{Exp} \; \texttt{"+"} \; \mathit{Term} \;\;/\;\; \mathit{Term}$
given that “in order to match an Exp, you much first try to match an Exp.” It’s sounds circular. In fact it is, and traditionally, PEG implementations suggested using the rule:
$\mathit{Exp} \; \longleftarrow \mathit{Term} \; (\texttt{"+"} \; \mathit{Term})^*$
instead. However, it turns out a proper parsing technique (using a packrat parser) can deal with it! It's not easy, but you can read the original paper by Alex Warth et al or this nice and detailed article by Guido van Rossum to see how it’s done.
Check your understandingAnalytic grammars:
- Are designed for parsing (recognition) rather than generation of languages.
- Are written with only a single variable on the left hand side of each rule.
- Use $\longleftarrow$ rather than $\longrightarrow$ to define rules.
- Execute alternatives left-to-right and once you get a match, you keep it, even if it leads to a dead end.
- Have powerful lookahead operators, thanks to deterministic execution.
- Can handle left-recursive rules.
Try this exercise before moving on:
S ⟶ "a" | "i" "x" "t" S ("e" S)?is ambiguous, but that the ambiguity can be resolved by rewriting as a PEG:
S ⟵ "a" / "i" "x" "t" S "e" S / "i" "x" "t" SHow, exactly, is the ambiguity resolved? (Historical note: this is the dangling else problem, a classic in parsing theory.)
Now let’s put everything together to come up with a Parsing Expression Grammar for Astro, that (1) properly separates lexical and phrase rules, (2) deals properly with precedence and associativity, (3) embraces left-recursion, and (4) correctly handles reserved words.
$ \begin{array}{lcl} \mathit{Program} &\longleftarrow& \mathit{Statement}^+ \\ \mathit{Statement} &\longleftarrow& \mathit{Assignment} \;/\; \mathit{PrintStmt} \\ \mathit{Assignment} &\longleftarrow& \mathit{id}\;\texttt{"="}\;\mathit{Exp}\;\texttt{";"} \\ \mathit{PrintStmt} &\longleftarrow& \mathit{print}\;\mathit{Exp}\;\texttt{";"} \\ \mathit{Exp} &\longleftarrow& \mathit{Exp} \;(\texttt{"+"}\,|\,\texttt{"–"})\; \mathit{Term} \;/\; \mathit{Term} \\ \mathit{Term} &\longleftarrow& \mathit{Term} \;(\texttt{"*"}\,|\,\texttt{"/"}\,|\,\texttt{"%"})\; \mathit{Factor} \;/\; \mathit{Factor} \\ \mathit{Factor} &\longleftarrow& \mathit{Primary}\;(\texttt{"**"}\;\mathit{Factor})? \;/\; \texttt{"–"} \; \mathit{Primary} \\ \mathit{Primary} &\longleftarrow& \mathit{num} \;/\; \mathit{Call} \;/\; \mathit{id} \;/\; \texttt{"("} \; \mathit{Exp} \; \texttt{")"} \\ \mathit{Call} &\longleftarrow& \mathit{id}\;\texttt{"("}\;(\mathit{Exp}\;(\texttt{","}\;\mathit{Exp})^*)?\;\texttt{")"} \\ \mathit{num} & \longleftarrow & \mathit{digit}^+ \; (\texttt{"."} \; \mathit{digit}^+)? \; ((\texttt{"E"} | \texttt{"e"}) \; (\texttt{"+"} | \texttt{"–"})? \;\mathit{digit}^+)? \\ \mathit{print} &\longleftarrow& \texttt{"print"}\;\neg\mathit{idchar}\\ \mathit{id} &\longleftarrow& \neg\mathit{print}\;\mathit{letter}\;\mathit{idchar}^*\\ \mathit{idchar} &\longleftarrow& \mathit{letter}\,|\,\mathit{digit}\,|\,\texttt{"_"}\\ \mathit{space}&\longleftarrow& \texttt{" "}\,|\,\texttt{"\t"}\,|\,\texttt{"\r"}\,|\,\texttt{"\n"}\,|\,\texttt{"//"}\;(\neg\texttt{"\\n"} \; \mathit{any})^* \end{array} $
Lexical and Phrase VariablesOhm also adopts the convention of starting the names of lexical variables with a lowercase letter and phrase variables with an uppercase letter.
PEGs are designed for direct parsing: there’s no need to “extract” any kind of machine. The matching process is built into the grammar rules.
TODO
Must reads:
The language designer will have to take into account any ambiguities in the language description. For example, in C, the expression (x)-y
hase two different syntactic interpretations: if x
refers to a type, then its a cast of a negation, otherwise it is a subtraction. You can’t tell at this phase.
Errors that occur during parsing, called syntax errors arise from malformed token sequences such as:
j = 4 * (6 − x;
i = /5
42 = x * 3
TODO: non-syntax errors
We’ve covered: