# Syntax

What is syntax and how does it differ, from say, semantics? What are the different ways that we can define a syntax? What is the difference between concrete and abstract syntax? And why do programming languages tend to use context-free grammars for their syntax?

## Motivation

A programming language gives us a way structure our thoughts. Each program, then, has a kind of internal structure, for example:

We generally capture this structure as a sequence of symbols; for example, the structure above might appear as:

dozen = 1 * (0 + sqrt(101.3));
y = dozen - 0;    // TADA 🥑
dozen = 0 / y;
print(cos(dozen)) ;//


or even:

MOVE PRODUCT OF 1 AND (SUM OF 0 AND SQRT OF 101.3) INTO DOZEN.
MOVE DIFFERECE OF DOZEN AND 0 INTO Y.
COMMENT TADA 🥑
MOVE QUOTIENT OF 0 AND Y INTO DOZEN.
PRINT COSINE OF DOZEN.
COMMENT


Of course, we don’t have to use strings! Visual languages do exist:

If you were a language designer, you’d deal with questions such as:

• Is code expressed in terms of blocks? Visual graphs? Text?
• If text, which characters are allowed? Is whitespace significant? Is case significant?
• Is there a lot of punctuation? Symbols? Or is it mostly words?
• How do we do comments? Pragmas? Directives?
• Do we express structure with indentation, curly-braces, terminating-end, or nested parentheses?
• Should operations have a functional (length(a)) or message-oriented (a.length) feel?
• What kind of shorthands and idioms should be made available?
• How do program units map to files or other physical containers?

## Definition

The syntax of a programming language is the set of rules that define which arrangements of symbols comprise structurally legal programs—and for each structurally legal program, what its structure actually is.

Syntactic rules should be able to tell us that these sentences are not well-formed in English:

• telescope a they the With hit saw.
• yqvfe.,--)AO7%;;;;;;;; ; !dogs

but that these are:

• They hit the telescope with a saw.
• They saw the hit with a telescope.

and so is this famous one, due to Chomsky:

• Colorless green ideas sleep furiously.

## Designing the Syntax of a Language

Let’s learn about syntax by going through the process of designing a syntax for a small language. A designer will often start by writing down examples of programs in the language. Let’s design a small language which we’ll call Astro. Programs will look like this:

// A simple program in Astro

radius = 55.2 * (-cos(2.8E-20) + 89) % 21;    // assignment statement
the_area = π * radius ** 2;                   // a built in identifier
print(the_area, hypot(2.28, 3 - radius) / 5); // call statement


So programs are structures as a sequence of one or more statements, each of which are assignments or calls, with expressions formed with numbers, variables, calls, and the usual arithmetic operators, which can have parentheses when needed. Comments look like the slash-slash comments of C-like languages. Let’s get a little more precise by turning these thoughts into pictures.

A program is a sequence of one or more statements:

Program

There will be two kinds of statements: assignments and calls, each ending with a semicolon:

Statement

Assignments and calls look like this:

Assignment
Call

An identifier is the computer science word for a name of something you define in your code, like a variable, constant, function, type, parameter, or similar thing. A keyword is a special word in your language, like let or if or while—these words generally do not name things they are used to structure the code. If a keyword is not allowed to be an identifier, we call it a reserved word. Astro does not have any keywords, but most languages do.

Identifiers in Astro begin with a letter, and can have letters, digits, and underscores (examples: x, last_attempt, p1, p2, overTheLimit). We’ll assume letter and digit are built-in, but we must define identifiers:

id

A statement performs an action (with a side-effect), but expressions represent the computation of a value. Expressions can be numbers, variables, function calls, or can be made up of operators applied to other expressions. We can put parentheses around any expression. This will do the trick (but spoiler alert: we will have to change it later):

Exp
Exercise: Explain the difference between expressions and statements to a friend. Make yourself a sacred vow never to mix the terms up. Precise language is a good thing. Sometimes we slip up, but still, let’s try not to say things like “plus statement.”

We mentioned numerals here, so let’s define these. Super simple for this language:

num

#### Wait why do we have rectangles vs. ovals?

Did you notice that there are two kinds of nodes in our syntax diagrams? Ovals are for nodes that internally cannot have spaces; rectangles are used for constructs whose components can have spaces between them. By spaces, we mean whitespace characters (space characters, tab characters, newlines and such), or comments. The special category called space defines how our syntactic components can be separated.

space

We’ve chosen the space character, the tab character, the carriage return character, the linefeed character (the linefeed (U+000A) is pretty much universally accepted as the “newline”), and comments, which start with // and go to the end of the current line.

The diagrams are a great way to carry out the syntactic design, but they are a little vague in places (notice the “any non-newline char” piece), and it’s really, really hard to process diagrams by machine. They don’t scale up to massive languages, either: a diagrammatic presentation of a language can get “spread out” and hard to see all at once. Text is much more dense then pictures, and more machine-friendly. So now it’s time to write a precise, formal grammar for our language.

## Grammars

Let’s covert our diagrams into actual grammars. This is actually quite easy to do (once you learn all the notation):

-- Grammar for the programming language Astro

Program    → Statement+
Statement  → Assignment ";"
| Call ";"
Assignment → id "=" Exp
Call       → id "(" (Exp ("," Exp)*)? ")"
Exp        → num
| id
| Call
| "-" Exp
| Exp ("+" | "-" | "*" | "/" | "%" | "**") Exp
| "(" Exp ")"
num        → digit+ ("." digit+)? (("E" | "e") ("+" | "-")? digit+)?
id         → letter (letter | digit | "_")*
space      → " " | "\t" | "\r" | "\n" | "//" ("\x00".."\x09" | "\x0b".."\x10ffff")*


This notation is pretty compact! Here’s a brief explainer:

• A grammar is a sequence of rules that derive, starting from the first, or start, rule, all and only those strings of characters comprising legally structured programs.
• Rule names beginning with lowercase letters are called lexical rules. Rule names beginning with uppercase letters are called phrase rules, and implicitly have a space* between all of the constructs on their right hand side.
• Certain symbols are predefined: letter (matches any unicode letter), digit (matches any Basic Latin digit "0".."9"), alnum (matches any letter or digit), end (matches when you are at the end of the input), and any (matches any character). The start rule is assumed to have an end at the end, even if not explicitly shown.
• Quotes mean you should handle its characters literally, except for the usual escapes for newline (\n), tab (\t), backslash (\\) and friends, as has become conventional in computer science. Also as per convention: * means “zero or more”, + means “one or more”, ? means zero-or-one (i.e., “optional”), and | means “or”.
• The .. defines a range of characters. Characters whose code points are known can be given with a \x prefixed to their (hex) code point; for example \x1f4a9 is 💩. The backslash can also be used to make some common characters more readable, e.g. \t stands for \x09 (tab character); \n stands for \x0a (the linefeed character, a.k.a. newline); \r stands for \x0d (the carriage return character). Not shown above, but useful later, will be \\ for \x5c (the backslash character itself); and \" for \x22 for the quote character.

#### Lexical vs. Phrase Rules

The lexical rules, with the exception of the special rule for space, show us how to combine characters into words and punctuation which will call tokens. The phrase rules show us how to combine words and symbols into phrases (where the words and symbols can, if you wish, be surrounded by spacing).

Exercise: Meditate on how profound this is. Have you seen anything like this before? Hint: Are you reading this page in English?

This means it is best to think about the syntax as actually having two distinct parts. The first part (the lexical rules) show us how we can tokenize a stream of characters into a stream of tokens, and the second part (the phrase rules) show us how to parse the token stream into the parse tree. Here’s an example. The source string:

  print( // ⿇🌿
420 );


is made up of these characters:

SPACE SPACE LATIN SMALL LETTER P LATIN SMALL LETTER R LATIN SMALL LETTER I LATIN SMALL LETTER N LATIN SMALL LETTER T LEFT PARENTHESIS TAB SOLIDUS SOLIDUS SPACE KANGXI RADICAL HEMP HERB LINEFEED DIGIT FOUR DIGIT TWO DIGIT ZERO SPACE RIGHT PARENTHESIS SEMICOLON

When we apply the lexical rules to this character stream, skipping the spaces (and comments), we get the token stream:

id(print) ( num(420) ) ;

Next, we apply the phrase rules allows us to uncover the parse tree:

A very important thing to note: The frontier of the parse tree is the token stream.

Exercise: Repeat this phrase to yourself five times: The frontier of the parse tree is the token stream.
The parse tree ends at tokens, not characters

Please take the time to consider how silly it would be if the parse tree expanded all of the lexical rules down to characters. Having the parse tree stop at tokens is a good example of what we would call “breaking a complex problem down into simpler parts.”

#### Dealing With Ambiguity

How did we do with our grammar? Well, you know, there is one issue: It can parse the string 9-3*7 in two ways:

Having more than one parse tree for a given input string means that our syntax description is ambiguous. This isn’t necessarily bad, because we could (if we wanted to) leave the ambiguity in the grammar but supply extra informal explanatory text in our language definition that determines how the ambiguity is to be resolved. 🙄🙄🙄 Okay, if you don’t like that idea, we can get rid of this particular ambiguity by adding extra rules in the grammar itself, that basically force an operator precedence:

• An expression is a sequence of one or more terms, separated by additive operators
• A term is a sequence of one or more factors, separated by multiplicative operators
• A factor is made up of primaries, separated by exponentiation operators
• A primary is the most basic thing (variables, numerals, calls), including parenthesized expressions, which allow unlimited nesting
Exp        → Term (("+" | "-") Term)*
Term       → Factor (("*" | "/" | "%") Factor)*
Factor     → Primary ("**" Primary)*
Primary    → num
| id
| Call
| "-" Primary
| "(" Exp ")"


Great! Now there is one and only one parse tree for that previously problematic expression:

Exercise: Argue why, or prove if you can, that, the new grammar is not ambiguous.

If you are still getting used to the grammar notation, let’s see the new rules as syntax diagrams:

Exp
Term
Factor
Primary

Our grammar rules have forced the binary operators into a precedence hierarchy!

• + and - have the lowest precedence.
• *, /, and / have the next higher precedence.
• ** has the highest precedence.

#### Associativity

Wait, we are not done with structuring our operators just yet. The way things stand now, the parse tree for 3-8-5 looks pretty flat:

It doesn’t suggest whether we mean to compute (3-8)-5 (= -10) or 3-(8-5) (= 0). As language designers, we should make the choice of left-associativity or right-associativity or even non-associativity for each operator. There is a way to do this. Let’s make the additive and multiplicative operators left associative and the exponentiation operator right associative:

Exp        → (Exp ("+" | "-"))? Term
Term       → (Term ("*" | "/" | "%"))? Factor
Factor     → Primary ("**" Factor)?


Or, equivalently, in diagrams:

Exp
Term
Factor

How the heck does this work? Study these parse trees, and hopefully the insight will come to you! (Hint: remember the grammar is designed to force the tree to “come out” a certain way.

Exercise: Study this until you understand it well.

For reference, here’s the final grammar:

Program    → Statement+
Statement  → Assignment ";"
| Call ";"
Assignment → id "=" Exp
Call       → id "(" (Exp ("," Exp)*)? ")"
Exp        → (Exp ("+" | "-"))? Term
Term       → (Term ("*" | "/" | "%"))? Factor
Factor     → Primary ("**" Factor)?
Primary    → num
| id
| Call
| "-" Primary
| "(" Exp ")"
num        → digit+ ("." digit+)?
id         → letter (letter | digit | "_")*
space      → " " | "\t" | "\r" | "\n" | "//" ("\x00".."\x09" | "\x0b".."\x10ffff")*


## Abstract Syntax

We have to do a lot of specification work to define programs as strings of characters. For example, we needed extra tokens (parentheses) and extra categories (Term, Factor, and Primary) in order to capture precedence. We have semicolons to terminate statements. Real languages have lots and lots of punctuation as well. The thing is, these parse trees can get huge!

But we can simplify things a bit. If we just need to know what a program is supposed to do, we don’t really need that much from our parse tree. We can abstract away a lot of the messy stuff, and translate our parse tree into an abstract syntax trees:

Crazy, right? Abstract syntax trees are much simpler. They’re more abstract. And by the way, since they exist, you will sometimes here parse trees called concrete syntax trees.

So how do we come up with an abstract syntax? Focus on what a program is trying to express, without worrying about punctuation. You don’t even have to worry about precedence and associativity (or even parentheses) because the abstract syntax is tree-structured, and not a string. So cool! Here’s how we think about it for Astro:
• A program is a sequence of statements
• A statement is either an assignment or a call
• An assignment is an identifier being assigned an expression
• A call has an identifier for the function name and a list of expressions for its arguments
• An expression can be a number, identifier, a unary operator applied to an operand, two expressions with a binary operator, or a call.

If you want to formalize this, you can write a tree grammar:

$\begin{array}{lcl} n: \mathsf{Nml} & & \\ i: \mathsf{Ide} & & \\ e: \mathsf{Exp} & = & n \mid i \mid -e \mid e+e \mid e-e \mid e\,\mathtt{*}\,e \mid e\,/\,e \mid e\,\mathtt{\%}\,e \mid e\,\mathtt{**}\,e \mid i\;e*\\ s: \mathsf{Stm} & = & i = e \mid i\;e*\\ p: \mathsf{Pro} & = & s+\\ \end{array}$

When translating to an abstract syntax, we not only get rid of intermediate syntax categories, but we also drop parentheses, commas, semicolons, brackets, and pretty much all punctuation. Precedence and associativity disappear because they are handled naturally within the AST structure. Some of the smaller keywords that you see in larger programming languages (in, to, of) also are not needed. Punctuation really only exists to give a “tree structure” to our linear program text. Once we uncover the tree structure, it is no longer needed!

Exercise: Give the parse tree and the abstract syntax tree for print(3-5-8-2*8-9-7**3**1); according to this grammar and abstract syntax.
What’s the point of the Abstract Syntax?

Sure, it’s massively simpler, but what do we use it for? We use it in language design to specify the semantics (meaning) of programs in the language. We use it in language implementation to know exactly how to write the compilers and interpreters and other language processors. It’s a lot easier to run interpreters, optimizers, static analyzers, code generators, and other similar tools off the ASTs than off the CSTs. A lot easier.

### Multiple Concrete Syntax Trees

Guess what else? ASTs is that they allow us to focus on programming language capabilities without worrying about syntactic details that people can reasonably have different preferences about. Take a look at this AST:

CLASSWORK
Stop! Don’t scroll by! Study this tree! What is this program “saying”?

Sure, it appears to a program in a language whose tree grammar probably looks like this:

$\begin{array}{lcl} n: \mathsf{Nml} & & \\ i: \mathsf{Ide} & & \\ t: \mathsf{Type} & = & \mathtt{int} \mid \mathtt{bool} \\ e: \mathsf{Exp} & = & n \mid i \mid -e \mid e+e \mid e-e \mid e\,\mathtt{*}\,e \mid e\,/\,e \mid e\,\mathtt{**}\,e \\ & & \mid\; e < e \mid e \leq e \mid e\mathtt\,{==}\,e \mid e \neq e \mid e > e \mid e \geq e \mid i\;e*\\ s: \mathsf{Stmt} & = & i = e \mid \mathtt{read}\;i* \mid \mathtt{write}\;e* \mid \mathtt{while}\;e\;b \\ d: \mathsf{Dec} & = & \mathtt{declare}\;(i\;t)*\\ b: \mathsf{Block} & = & {d*}\;s+\\ p: \mathsf{Program} & = & b\\ \end{array}$

But what does the actual concrete syntax look like? There’s no specific answer. There are an infinite number of possibilities! Here is just a small sampling of programs that could have generated that AST:

program:
var x, y: int
// In this language, indentation defines blocks
while y - 5 == 3:
var y: int
get(x)
get(y)
x = 2 * (3 + y)
put(5)


int x, y;
while y - 5 = 3 {
int y;
STDIN -> x;
STDIN -> y;
x <- 2 * (3 + y);
}
STDOUT <- 5;


COMMENT THIS LOOKS LIKE OLD CODE
DECLARE INT X.
DECLARE INT Y.
WHILE DIFFERENCE OF Y AND 5 IS 3 LOOP:
DECLARE INT Y.
READ FROM STDIN INTO X.
READ FROM STDIN INTO Y.
MOVE PRODUCT OF 2 AND (SUM OF 3 AND Y) INTO X.
END LOOP.
WRITE 5 TO STDOUT.


(program
(declare x int)
(declare y int)
(while (= (- y 5) 3)
(seq (define (y int)) (read x y) (assign x (* 2 (+ 3 y))))
)
(write 5)
)


<program>
<declare var="x" type="int"/>
<declare var="y" type="int"/>
<while>
<equals>
<minus><varexp ref="y"/><intlit value="5"/></minus>
<intlit value="3"/>
</equals>
<declare var="y" type="int"/>
<read vars="x y"/>
<assign varexp="x">
<times>
<intlit value="2"/>
<plus>
<intlit value="3"/>
<varexp ref="y"/>
</plus>
</times>
</assign>
</while>
<write>
<intlit value="5"/>
</write>
</program>

Exercise: Write the example in a JSON-like notation.
Exercise: The fourth example above has an “S-expression syntax” which is seen in languages like Lisp and Clojure. In such languages, concrete and abstract syntaxes can be said to be very close to each other. Why is this claim true?

### More AST Examples

Getting a good feel for abstract syntax naturally takes practice. Here are a couple examples to study to get a feel for how abstract we need to get. Here’s a fragment of Java:

if (((f(--x)))) {
System.out.println(- 3*q);
} else if (!here||there    &&everywhere) {
while (false) {
x += x >> 2 | 3  -++  x;
}
album[5].track("name") = title.scalars[2];
}


And here’s a little JavaScript program:

export class C {
q() { const [p, q] = a.c[1]}
r({x}, ...z) { while (c.c&&p|- -x) new C() }
}


## The Problem of Context

Did you notice that the grammars we saw above left some things out? For example, none of our grammars could capture a simple rule like:

“You can only use a variable if it has been previously declared.”

Real programming languages have this issue, too. For example the official grammar for Java will not derive this program:

class A {2 / == {{;


so a compiler would report a syntax error. But this program:

class A {int x = y;}


is structurally well formed according to the official syntax of the Java language! The compilation unit consists of a type declaration that is a class declaration whose body consists of a field declaration with a type, a name, and an initializing expression which is an identifier. But we know this program is not legal, since the identifier y has not been declared. What’s going on? The same goes for this C program, which has multiple problems despite being “structurally well-formed” according to the C grammar:

/*
* This program is derivable by the official grammar for C.
*/
int main() {
f(x)->p = sqrt(3.3);
return "dog";
}


So why do these language grammars allow this nonsense? The short answer is, it’s really really really really really really really really hard and really really really really really really really really ugly to define contextual rules grammatically! Capturing the “all identifiers must be declared before use” constraint would require that a grammar rule for expressions not just generate any identifier, but only certain identifiers known from some outside context. Context! In general, capturing context in a grammar leads to grammars that:

• are incredibly difficult for humans to figure out,
• are nearly impossible for humans to read (they’re ridiculously long and incredibly weird),
• look nothing at all like the way we would check these rules in a compiler,
• give rise to trees that are weird since expansion of nodes is based on sibling nodes, not just the parent, which makes semantics hard to define,
• and
• horribly expensive even for a computer to parse.

Try the following exercise if you’re not convinced:

Exercise: Write grammars for:
• $\{ a^nb^nc^n \;\mid\; n \geq 0 \}$
• $\{ a^{2^n} \;\mid\; n \geq 0 \}$
• $\{ ww \;\mid\; w \in \{a,b\}* \}$
Hint 1: Grammar rules these languages will have to have at least some rules with multiple symbols on the left hand side. That is because there exist proofs (not shown here) that these three languages are not context-free.

Hint 2: Having trouble? Answers can be found on my notes on Language Theory.
Exercise: Do some research to see if there are any known lower complexity bounds on the parsing of context-sensitive grammars.

### Defining Contextual Constraints

If you think about it, so many constraints require context. Here’s a small sampling from various languages:

• Identifiers must not be redeclared within a scope
• Identifiers may not be used in a place where they might not have been initialized
• Types of all subexpressions must be consistent in an expression
• The correct number of arguments must appear in a call
• Access modifiers (private, protected, public) must be honored
• All execution paths through a function must end in a return
• All possible arms of a match or switch or discriminated union must be covered
• All abstract methods must be implemented or declared abstract
• All declared local variables must be used
• All private methods in a class must be used
• ...and so many more

Since context sensitive grammars are too hard to write, too hard to understand, too hard to parse, what do we do? There are at least two options:

• Use a completely new kind of grammatical formalism. Some of these do exist, including Attribute Grammars and Two-Level Grammars.
• Push contextual constraints into the world of semantics. Let the syntax be happy with undeclared variables appearing in expressions (as long as the expressions are otherwise well formed). With this approach, we say that using undeclared variables is ”meaningless,” you know, a semantic problem. And because the meaning (ahem, semantics) is computed before the program is run, we would call constraint violations static semantic errors.

Which take do you prefer? Slonneger and Kurtz take the former approach, writing:

The classification of such an error entails some controversy. Language implementers, such as compiler writers, say that such an infraction belongs to the static semantics of a language since it involves the meaning of symbols and can be determined statically, which means solely derived from the text of the program. We argue that static errors belong to the syntax, not the semantics, of a language. Consider a program in which we declare a constant:
    const c = 5;
In the context of this declaration, the following assignment commands are erroneous for essentially the same reason: It makes no sense to assign an expression value to a constant.
    5 := 66;
c := 66;
The error in the first command can be determined based on the context-free grammar (BNF) of the language, but the second is normally recognized as part of checking the context constraints. Our view is that both errors involve the incorrect formation of symbols in a command—that is, the syntax of the language. The basis of these syntactic restrictions is to avoid commands that are meaningless given the usual model of the language.

Though it may be difficult to draw the line accurately between syntax and semantics, we hold that issues normally dealt with from the static text should be called syntax, and those that involve a program’s behavior during execution be called semantics. Therefore we consider syntax to have two components: the context-fr ee syntax defined by a BNF specification and the context-sensitive syntax consisting of context conditions or constraints that legal programs must obey.

### What about PEGs?

PEGs have lookahead power and can hence be written for languages which are known to be non-context-free! Here’s a PEG for the non-context-free language $\{ a^nb^nc^n \,\mid\, n \geq 0 \}$:

s ← &(x "c") "a"+ y
x ← "a" x? "b"
y ← "b" y? "c"


Awesome of course, but this example is not that sophisticated, and PEGs don’t make life that much better at all for the contextual constraints in real languages. Besides, that lookahead ain’t cheap to parse!

Exercise: For the Astro language we defined above, try to build a PEG that captures the rule that identifiers can only be used if previously assigned to. If you can do that, try to deal with constraints such as having certain identifiers predeclared as functions (such as sqrt, sin, cos, and hypot) and only these can be called, while other variables cannot. Would you say that PEGs are an answer to the problem of capturing contextual constraints grammatically?

### Indentation-Sensitive Syntax

In addition to the context issues that exist between a language’s syntax and its static semantics, similar issues exist between the lexical syntax and phrase syntax in languages that are indentation-sensitive. Traditionally, lexical analysis proceeds as if the size of whitespace runs don’t matter very much. But in languages like Python and Haskell, they do. A lot. We cannot just skip any whitespace between tokens the way we’ve done so above!

The way to handle indentation sensitivity is, during lexical analysis, keep track of indentation levels (in some sort of context object), and replace certain runs of whitespace with INDENT and DEDENT tokens.

The process isn’t too hard. For complete details, read the section on indentation in the Lexical Analysis chapter of the Python Language Reference.

## Syntax in the Real World

These notes were very much just an introduction. There are at least three directions in which you will want to continue your study:

### Alternate Syntactic Descriptions

The grammar notation we used above is not the only one that people have used. There are quite a few more in use, including:

Context-Free Grammars (Chomsky Type-2)
These are very “mathematical” and too low-level. They just use individual symbols. They have no special operators for alternation or repetition. They are too hard to read and are never used directly in practice. A CFG is a 4-tuple $(V,\Sigma,R,S)$, where $V$ is the set of variables, $\Sigma$ is the alphabet (of symbols in the language), $V \cap \Sigma = \varnothing$, $R \subseteq V \times (V \cup \Sigma)*$ is the rule set, and $S \in V$ is the start symbol. For the language of simple arithmetic expressions over natural numbers, A CFG is:

$( \\ \;\;\; \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, /, (, )\}, \\ \;\;\; \{E, T, F, N, D\}, \\ \;\;\; \{ \\ \;\;\;\;\;\;\; (D, 0), (D, 1), (D, 2), (D, 3), (D, 4), (D, 5), (D, 6), \\ \;\;\;\;\;\;\; (D, 7), (D, 8), (D, 9), (N, D), (N, ND), (E, T), (E, E+T), \\ \;\;\;\;\;\;\; (E, E-T), (T,F), (T, T*F), (T, T/F), (F,N), (F, (E)) \\ \;\;\; \}, \\ \;\;\; E \\ )$

Rather than writing them in tuple form, we often just list the rules, making sure the start symbol appears first, and making the variables and alphabet symbols inferrable by context:

E → E + T
E → E - T
E → T
T → T * F
T → T / F
T → F
F → ( E )
F → N
N → N D
N → D
D → 0
D → 1
D → 2
D → 3
D → 4
D → 5
D → 6
D → 7
D → 8
D → 9

BNF
With BNF, we add the ability to have our variables and alphabet symbols be written in a more readable form. To distinguish the two, we quote the variables with angle brackets. One special operator, | (meaning “or”), is permitted.
<expression> ::= <term>
|  <expression> + <term>
|  <expression> - <term>
<term>       ::= <factor>
|  <term> * <factor>
|  <term> / <factor>
<factor>     ::= <number>
|  ( <expression> )
<number>     ::= <digit> | <number> <digit>
<digit>      ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

EBNF
EBNF, or Extended Backus Naur Form, quotes the alphabet symbols instead of the variables, and adds operators for alternation (|), optionality (?), zero-or-more (*), and one-or-more (+).
EXP    = TERM (('+' | '-') TERM)*
TERM   = FACTOR (('*' | '/') FACTOR)*
FACTOR = NUMBER | '(' EXP ')'
NUMBER = ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')+

ABNF
ABNF, or Augmented Backus Naur Form, is the specification used to describe all the Internet Protocols. It features tools for repetition, alternatives, order-independence, and value ranges.
exp    = term *(("+" / "-") term)
term   = factor *(("*" / "/") factor)
factor = number / "(" exp ")"
number = %x30-39

Parsing Expression Grammars
PEGs are a kind of analytic grammar created by Bryan Ford. There is no backtracking through alternation, so grammars are never ambiguous. They have powerful lookahead features allowing them to define languages that are not context-free in the sense of Chomsky.
Exp    ← Term (("+" / "-") Term)*
Term   ← Factor (("*" / "/") Factor)*
Factor ← number / "(" Exp ")"
number ← ("0".."9")+


### The Syntax of Real Programming Languages

You might like to see how much work goes into writing a grammar for a real-world programming language. Here are a few to check out. Note the wide variety of notations:

Exercise: Browse each of these descriptions. Take notes that summarize the basic ideas of each. For each, how to they distinguish lexical from phrase syntax? How to they distinguish tokens from variables? Do they feel more like CFGs, BNF, ENBF, PEGs, or are they something else entirely? What interesting conventions do they have that you had not thought of?

To be fair, programming language designers aren’t only thinking about the technical aspects of writing a grammar. They literally have to put creative effort into designing languages people like to use! There are many issues in syntax design; I have a separate page of notes on this topic if you’re interested.

### Parsing Theory and Practice

Irl, people use grammars to write correct compilers. The part of a compiler that uses a grammar to geenrate the parse tree from program text is called a parser. While you could write a parser from hand, most people use language libraries to essentially generate a parser from a grammar! Popular libraries include:

## Summary

We’ve covered:

• Motivation for Syntax
• Difference between syntax and semantics
• How to design a syntax
• Syntax diagrams
• Grammars
• Lexical vs. Phrase rules
• Ambiguity
• Associativity
• Abstract syntax
• Tree grammars
• The problem of context
• Syntax IRL