What is syntax and how does it differ, from say, semantics? What are the different ways that we can define a syntax? What is the difference between concrete and abstract syntax? And why do programming languages tend to use context-free grammars for their syntax?


A programming language gives us a way structure our thoughts. Each program, then, has a kind of internal structure, for example:


We generally capture this structure as a sequence of symbols; for example, the structure above might appear as:

dozen = 1 * (0 + sqrt(101.3));
y = dozen - 0;    // TADA 🥑
dozen = 0 / y;
print(cos(dozen)) ;//

or even:


Of course, we don’t have to use strings! Visual languages do exist:


If you were a language designer, you’d deal with questions such as:


The syntax of a programming language is the set of rules that define which arrangements of symbols comprise structurally legal programs—and for each structurally legal program, what its structure actually is.

Syntactic rules should be able to tell us that these sentences are not well-formed in English:

but that these are:

and so is this famous one, due to Chomsky:

Designing the Syntax of a Language

Let’s learn about syntax by going through the process of designing a syntax for a small language. A designer will often start by writing down examples of programs in the language. Let’s design a small language which we’ll call Astro. Programs will look like this:

// A simple program in Astro

radius = 55.2 * (-cos(2.8E-20) + 89) % 21;    // assignment statement
the_area = π * radius ** 2;                   // a built in identifier
print(the_area, hypot(2.28, 3 - radius) / 5); // call statement

So programs are structures as a sequence of one or more statements, each of which are assignments or calls, with expressions formed with numbers, variables, calls, and the usual arithmetic operators, which can have parentheses when needed. Comments look like the slash-slash comments of C-like languages. Let’s get a little more precise by turning these thoughts into pictures.

A program is a sequence of one or more statements:


There will be two kinds of statements: assignments and calls, each ending with a semicolon:


Assignments and calls look like this:


An identifier is the computer science word for a name of something you define in your code, like a variable, constant, function, type, parameter, or similar thing. A keyword is a special word in your language, like let or if or while—these words generally do not name things they are used to structure the code. If a keyword is not allowed to be an identifier, we call it a reserved word. Astro does not have any keywords, but most languages do.

Identifiers in Astro begin with a letter, and can have letters, digits, and underscores (examples: x, last_attempt, p1, p2, overTheLimit). We’ll assume letter and digit are built-in, but we must define identifiers:


A statement performs an action (with a side-effect), but expressions represent the computation of a value. Expressions can be numbers, variables, function calls, or can be made up of operators applied to other expressions. We can put parentheses around any expression. This will do the trick (but spoiler alert: we will have to change it later):

Exercise: Explain the difference between expressions and statements to a friend. Make yourself a sacred vow never to mix the terms up. Precise language is a good thing. Sometimes we slip up, but still, let’s try not to say things like “plus statement.”

We mentioned numerals here, so let’s define these. Super simple for this language:


Wait why do we have rectangles vs. ovals?

Did you notice that there are two kinds of nodes in our syntax diagrams? Ovals are for nodes that internally cannot have spaces; rectangles are used for constructs whose components can have spaces between them. By spaces, we mean whitespace characters (space characters, tab characters, newlines and such), or comments. The special category called space defines how our syntactic components can be separated.


We’ve chosen the space character, the tab character, the carriage return character, the linefeed character (the linefeed (U+000A) is pretty much universally accepted as the “newline”), and comments, which start with // and go to the end of the current line.

The diagrams are a great way to carry out the syntactic design, but they are a little vague in places (notice the “any non-newline char” piece), and it’s really, really hard to process diagrams by machine. They don’t scale up to massive languages, either: a diagrammatic presentation of a language can get “spread out” and hard to see all at once. Text is much more dense then pictures, and more machine-friendly. So now it’s time to write a precise, formal grammar for our language.


Let’s covert our diagrams into actual grammars. This is actually quite easy to do (once you learn all the notation):

-- Grammar for the programming language Astro

Program    → Statement+
Statement  → Assignment ";"
           | Call ";"
Assignment → id "=" Exp
Call       → id "(" (Exp ("," Exp)*)? ")"
Exp        → num
           | id 
           | Call
           | "-" Exp
           | Exp ("+" | "-" | "*" | "/" | "%" | "**") Exp
           | "(" Exp ")"
num        → digit+ ("." digit+)? (("E" | "e") ("+" | "-")? digit+)?
id         → letter (letter | digit | "_")*
space      → " " | "\t" | "\r" | "\n" | "//" ("\x00".."\x09" | "\x0b".."\x10ffff")*

This notation is pretty compact! Here’s a brief explainer:

Lexical vs. Phrase Rules

The lexical rules, with the exception of the special rule for space, show us how to combine characters into words and punctuation which will call tokens. The phrase rules show us how to combine words and symbols into phrases (where the words and symbols can, if you wish, be surrounded by spacing).

Exercise: Meditate on how profound this is. Have you seen anything like this before? Hint: Are you reading this page in English?

This means it is best to think about the syntax as actually having two distinct parts. The first part (the lexical rules) show us how we can tokenize a stream of characters into a stream of tokens, and the second part (the phrase rules) show us how to parse the token stream into the parse tree. Here’s an example. The source string:

  print( // ⿇🌿
420 );

is made up of these characters:


When we apply the lexical rules to this character stream, skipping the spaces (and comments), we get the token stream:

id(print) ( num(420) ) ;

Next, we apply the phrase rules allows us to uncover the parse tree:


A very important thing to note: The frontier of the parse tree is the token stream.

Exercise: Repeat this phrase to yourself five times: The frontier of the parse tree is the token stream.
The parse tree ends at tokens, not characters

Please take the time to consider how silly it would be if the parse tree expanded all of the lexical rules down to characters. Having the parse tree stop at tokens is a good example of what we would call “breaking a complex problem down into simpler parts.”

Dealing With Ambiguity

How did we do with our grammar? Well, you know, there is one issue: It can parse the string 9-3*7 in two ways:


Having more than one parse tree for a given input string means that our syntax description is ambiguous. This isn’t necessarily bad, because we could (if we wanted to) leave the ambiguity in the grammar but supply extra informal explanatory text in our language definition that determines how the ambiguity is to be resolved. 🙄🙄🙄 Okay, if you don’t like that idea, we can get rid of this particular ambiguity by adding extra rules in the grammar itself, that basically force an operator precedence:

Exp        → Term (("+" | "-") Term)*
Term       → Factor (("*" | "/" | "%") Factor)*
Factor     → Primary ("**" Primary)*
Primary    → num
           | id
           | Call
           | "-" Primary
           | "(" Exp ")"

Great! Now there is one and only one parse tree for that previously problematic expression:


Exercise: Argue why, or prove if you can, that, the new grammar is not ambiguous.

If you are still getting used to the grammar notation, let’s see the new rules as syntax diagrams:


Our grammar rules have forced the binary operators into a precedence hierarchy!


Wait, we are not done with structuring our operators just yet. The way things stand now, the parse tree for 3-8-5 looks pretty flat:


It doesn’t suggest whether we mean to compute (3-8)-5 (= -10) or 3-(8-5) (= 0). As language designers, we should make the choice of left-associativity or right-associativity or even non-associativity for each operator. There is a way to do this. Let’s make the additive and multiplicative operators left associative and the exponentiation operator right associative:

Exp        → (Exp ("+" | "-"))? Term
Term       → (Term ("*" | "/" | "%"))? Factor
Factor     → Primary ("**" Factor)?

Or, equivalently, in diagrams:


How the heck does this work? Study these parse trees, and hopefully the insight will come to you! (Hint: remember the grammar is designed to force the tree to “come out” a certain way.


Exercise: Study this until you understand it well.

For reference, here’s the final grammar:

Program    → Statement+
Statement  → Assignment ";"
           | Call ";"
Assignment → id "=" Exp
Call       → id "(" (Exp ("," Exp)*)? ")"
Exp        → (Exp ("+" | "-"))? Term
Term       → (Term ("*" | "/" | "%"))? Factor
Factor     → Primary ("**" Factor)?
Primary    → num
           | id
           | Call
           | "-" Primary
           | "(" Exp ")"
num        → digit+ ("." digit+)?
id         → letter (letter | digit | "_")*
space      → " " | "\t" | "\r" | "\n" | "//" ("\x00".."\x09" | "\x0b".."\x10ffff")*

Abstract Syntax

We have to do a lot of specification work to define programs as strings of characters. For example, we needed extra tokens (parentheses) and extra categories (Term, Factor, and Primary) in order to capture precedence. We have semicolons to terminate statements. Real languages have lots and lots of punctuation as well. The thing is, these parse trees can get huge!

But we can simplify things a bit. If we just need to know what a program is supposed to do, we don’t really need that much from our parse tree. We can abstract away a lot of the messy stuff, and translate our parse tree into an abstract syntax trees:


Crazy, right? Abstract syntax trees are much simpler. They’re more abstract. And by the way, since they exist, you will sometimes here parse trees called concrete syntax trees.

So how do we come up with an abstract syntax? Focus on what a program is trying to express, without worrying about punctuation. You don’t even have to worry about precedence and associativity (or even parentheses) because the abstract syntax is tree-structured, and not a string. So cool! Here’s how we think about it for Astro:

If you want to formalize this, you can write a tree grammar:

$ \begin{array}{lcl} n: \mathsf{Nml} & & \\ i: \mathsf{Ide} & & \\ e: \mathsf{Exp} & = & n \mid i \mid -e \mid e+e \mid e-e \mid e\,\mathtt{*}\,e \mid e\,/\,e \mid e\,\mathtt{\%}\,e \mid e\,\mathtt{**}\,e \mid i\;e*\\ s: \mathsf{Stm} & = & i = e \mid i\;e*\\ p: \mathsf{Pro} & = & s+\\ \end{array}$

When translating to an abstract syntax, we not only get rid of intermediate syntax categories, but we also drop parentheses, commas, semicolons, brackets, and pretty much all punctuation. Precedence and associativity disappear because they are handled naturally within the AST structure. Some of the smaller keywords that you see in larger programming languages (in, to, of) also are not needed. Punctuation really only exists to give a “tree structure” to our linear program text. Once we uncover the tree structure, it is no longer needed!

Exercise: Give the parse tree and the abstract syntax tree for print(3-5-8-2*8-9-7**3**1); according to this grammar and abstract syntax.
What’s the point of the Abstract Syntax?

Sure, it’s massively simpler, but what do we use it for? We use it in language design to specify the semantics (meaning) of programs in the language. We use it in language implementation to know exactly how to write the compilers and interpreters and other language processors. It’s a lot easier to run interpreters, optimizers, static analyzers, code generators, and other similar tools off the ASTs than off the CSTs. A lot easier.

Multiple Concrete Syntax Trees

Guess what else? ASTs is that they allow us to focus on programming language capabilities without worrying about syntactic details that people can reasonably have different preferences about. Take a look at this AST:


Stop! Don’t scroll by! Study this tree! What is this program “saying”?

Sure, it appears to a program in a language whose tree grammar probably looks like this:

$ \begin{array}{lcl} n: \mathsf{Nml} & & \\ i: \mathsf{Ide} & & \\ t: \mathsf{Type} & = & \mathtt{int} \mid \mathtt{bool} \\ e: \mathsf{Exp} & = & n \mid i \mid -e \mid e+e \mid e-e \mid e\,\mathtt{*}\,e \mid e\,/\,e \mid e\,\mathtt{**}\,e \\ & & \mid\; e < e \mid e \leq e \mid e\mathtt\,{==}\,e \mid e \neq e \mid e > e \mid e \geq e \mid i\;e*\\ s: \mathsf{Stmt} & = & i = e \mid \mathtt{read}\;i* \mid \mathtt{write}\;e* \mid \mathtt{while}\;e\;b \\ d: \mathsf{Dec} & = & \mathtt{declare}\;(i\;t)*\\ b: \mathsf{Block} & = & {d*}\;s+\\ p: \mathsf{Program} & = & b\\ \end{array}$

But what does the actual concrete syntax look like? There’s no specific answer. There are an infinite number of possibilities! Here is just a small sampling of programs that could have generated that AST:

    var x, y: int
    // In this language, indentation defines blocks
    while y - 5 == 3:
        var y: int
        x = 2 * (3 + y)

int x, y;
while y - 5 = 3 {
    int y;
    STDIN -> x;
    STDIN -> y;
    x <- 2 * (3 + y);
STDOUT <- 5;


  (declare x int)
  (declare y int)
  (while (= (- y 5) 3)
    (seq (define (y int)) (read x y) (assign x (* 2 (+ 3 y))))
  (write 5)

  <declare var="x" type="int"/>
  <declare var="y" type="int"/>
      <minus><varexp ref="y"/><intlit value="5"/></minus>
      <intlit value="3"/>
    <declare var="y" type="int"/>
    <read vars="x y"/>
    <assign varexp="x">
        <intlit value="2"/>
          <intlit value="3"/>
          <varexp ref="y"/>
    <intlit value="5"/>
Exercise: Write the example in a JSON-like notation.
Exercise: The fourth example above has an “S-expression syntax” which is seen in languages like Lisp and Clojure. In such languages, concrete and abstract syntaxes can be said to be very close to each other. Why is this claim true?

More AST Examples

Getting a good feel for abstract syntax naturally takes practice. Here are a couple examples to study to get a feel for how abstract we need to get. Here’s a fragment of Java:

if (((f(--x)))) {
    System.out.println(- 3*q);
} else if (!here||there    &&everywhere) {
    while (false) {
        x += x >> 2 | 3  -++  x;
    album[5].track("name") = title.scalars[2];


And here’s a little JavaScript program:

export class C {
  q() { const [p, q] = a.c[1]}
  r({x}, ...z) { while (c.c&&p|- -x) new C() }


The Problem of Context

Did you notice that the grammars we saw above left some things out? For example, none of our grammars could capture a simple rule like:

“You can only use a variable if it has been previously declared.”

Real programming languages have this issue, too. For example the official grammar for Java will not derive this program:

class A {2 / == {{;

so a compiler would report a syntax error. But this program:

class A {int x = y;}

is structurally well formed according to the official syntax of the Java language! The compilation unit consists of a type declaration that is a class declaration whose body consists of a field declaration with a type, a name, and an initializing expression which is an identifier. But we know this program is not legal, since the identifier y has not been declared. What’s going on? The same goes for this C program, which has multiple problems despite being “structurally well-formed” according to the C grammar:

 * This program is derivable by the official grammar for C.
int main() {
    f(x)->p = sqrt(3.3);
    return "dog";

So why do these language grammars allow this nonsense? The short answer is, it’s really really really really really really really really hard and really really really really really really really really ugly to define contextual rules grammatically! Capturing the “all identifiers must be declared before use” constraint would require that a grammar rule for expressions not just generate any identifier, but only certain identifiers known from some outside context. Context! In general, capturing context in a grammar leads to grammars that:

Try the following exercise if you’re not convinced:

Exercise: Write grammars for:
  • $\{ a^nb^nc^n \;\mid\; n \geq 0 \}$
  • $\{ a^{2^n} \;\mid\; n \geq 0 \}$
  • $\{ ww \;\mid\; w \in \{a,b\}* \}$
Hint 1: Grammar rules these languages will have to have at least some rules with multiple symbols on the left hand side. That is because there exist proofs (not shown here) that these three languages are not context-free.

Hint 2: Having trouble? Answers can be found on my notes on Language Theory.
Exercise: Do some research to see if there are any known lower complexity bounds on the parsing of context-sensitive grammars.

Defining Contextual Constraints

If you think about it, so many constraints require context. Here’s a small sampling from various languages:

Since context sensitive grammars are too hard to write, too hard to understand, too hard to parse, what do we do? There are at least two options:

Which take do you prefer? Slonneger and Kurtz take the former approach, writing:

The classification of such an error entails some controversy. Language implementers, such as compiler writers, say that such an infraction belongs to the static semantics of a language since it involves the meaning of symbols and can be determined statically, which means solely derived from the text of the program. We argue that static errors belong to the syntax, not the semantics, of a language. Consider a program in which we declare a constant:
    const c = 5;
In the context of this declaration, the following assignment commands are erroneous for essentially the same reason: It makes no sense to assign an expression value to a constant.
    5 := 66;
    c := 66;
The error in the first command can be determined based on the context-free grammar (BNF) of the language, but the second is normally recognized as part of checking the context constraints. Our view is that both errors involve the incorrect formation of symbols in a command—that is, the syntax of the language. The basis of these syntactic restrictions is to avoid commands that are meaningless given the usual model of the language.

Though it may be difficult to draw the line accurately between syntax and semantics, we hold that issues normally dealt with from the static text should be called syntax, and those that involve a program’s behavior during execution be called semantics. Therefore we consider syntax to have two components: the context-fr ee syntax defined by a BNF specification and the context-sensitive syntax consisting of context conditions or constraints that legal programs must obey.

What about PEGs?

PEGs have lookahead power and can hence be written for languages which are known to be non-context-free! Here’s a PEG for the non-context-free language $\{ a^nb^nc^n \,\mid\, n \geq 0 \}$:

s ← &(x "c") "a"+ y
x ← "a" x? "b"
y ← "b" y? "c"

Awesome of course, but this example is not that sophisticated, and PEGs don’t make life that much better at all for the contextual constraints in real languages. Besides, that lookahead ain’t cheap to parse!

Exercise: For the Astro language we defined above, try to build a PEG that captures the rule that identifiers can only be used if previously assigned to. If you can do that, try to deal with constraints such as having certain identifiers predeclared as functions (such as sqrt, sin, cos, and hypot) and only these can be called, while other variables cannot. Would you say that PEGs are an answer to the problem of capturing contextual constraints grammatically?

Indentation-Sensitive Syntax

In addition to the context issues that exist between a language’s syntax and its static semantics, similar issues exist between the lexical syntax and phrase syntax in languages that are indentation-sensitive. Traditionally, lexical analysis proceeds as if the size of whitespace runs don’t matter very much. But in languages like Python and Haskell, they do. A lot. We cannot just skip any whitespace between tokens the way we’ve done so above!

The way to handle indentation sensitivity is, during lexical analysis, keep track of indentation levels (in some sort of context object), and replace certain runs of whitespace with INDENT and DEDENT tokens.

The process isn’t too hard. For complete details, read the section on indentation in the Lexical Analysis chapter of the Python Language Reference.

Syntax in the Real World

These notes were very much just an introduction. There are at least three directions in which you will want to continue your study:

Alternate Syntactic Descriptions

The grammar notation we used above is not the only one that people have used. There are quite a few more in use, including:

Context-Free Grammars (Chomsky Type-2)
These are very “mathematical” and too low-level. They just use individual symbols. They have no special operators for alternation or repetition. They are too hard to read and are never used directly in practice. A CFG is a 4-tuple $(V,\Sigma,R,S)$, where $V$ is the set of variables, $\Sigma$ is the alphabet (of symbols in the language), $V \cap \Sigma = \varnothing$, $R \subseteq V \times (V \cup \Sigma)*$ is the rule set, and $S \in V$ is the start symbol. For the language of simple arithmetic expressions over natural numbers, A CFG is:

$( \\ \;\;\; \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, /, (, )\}, \\ \;\;\; \{E, T, F, N, D\}, \\ \;\;\; \{ \\ \;\;\;\;\;\;\; (D, 0), (D, 1), (D, 2), (D, 3), (D, 4), (D, 5), (D, 6), \\ \;\;\;\;\;\;\; (D, 7), (D, 8), (D, 9), (N, D), (N, ND), (E, T), (E, E+T), \\ \;\;\;\;\;\;\; (E, E-T), (T,F), (T, T*F), (T, T/F), (F,N), (F, (E)) \\ \;\;\; \}, \\ \;\;\; E \\ )$

Rather than writing them in tuple form, we often just list the rules, making sure the start symbol appears first, and making the variables and alphabet symbols inferrable by context:

E → E + T
E → E - T
E → T
T → T * F
T → T / F
T → F
F → ( E )
F → N
N → N D
N → D
D → 0
D → 1
D → 2
D → 3
D → 4
D → 5
D → 6
D → 7
D → 8
D → 9
With BNF, we add the ability to have our variables and alphabet symbols be written in a more readable form. To distinguish the two, we quote the variables with angle brackets. One special operator, | (meaning “or”), is permitted.
<expression> ::= <term>
              |  <expression> + <term>
              |  <expression> - <term>
<term>       ::= <factor>
              |  <term> * <factor>
              |  <term> / <factor>
<factor>     ::= <number>
              |  ( <expression> )
<number>     ::= <digit> | <number> <digit>
<digit>      ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
EBNF, or Extended Backus Naur Form, quotes the alphabet symbols instead of the variables, and adds operators for alternation (|), optionality (?), zero-or-more (*), and one-or-more (+).
EXP    = TERM (('+' | '-') TERM)*
TERM   = FACTOR (('*' | '/') FACTOR)*
NUMBER = ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')+
ABNF, or Augmented Backus Naur Form, is the specification used to describe all the Internet Protocols. It features tools for repetition, alternatives, order-independence, and value ranges.
exp    = term *(("+" / "-") term)
term   = factor *(("*" / "/") factor)
factor = number / "(" exp ")"
number = %x30-39
Parsing Expression Grammars
PEGs are a kind of analytic grammar created by Bryan Ford. There is no backtracking through alternation, so grammars are never ambiguous. They have powerful lookahead features allowing them to define languages that are not context-free in the sense of Chomsky.
Exp    ← Term (("+" / "-") Term)*
Term   ← Factor (("*" / "/") Factor)*
Factor ← number / "(" Exp ")"
number ← ("0".."9")+

The Syntax of Real Programming Languages

You might like to see how much work goes into writing a grammar for a real-world programming language. Here are a few to check out. Note the wide variety of notations:

Exercise: Browse each of these descriptions. Take notes that summarize the basic ideas of each. For each, how to they distinguish lexical from phrase syntax? How to they distinguish tokens from variables? Do they feel more like CFGs, BNF, ENBF, PEGs, or are they something else entirely? What interesting conventions do they have that you had not thought of?

To be fair, programming language designers aren’t only thinking about the technical aspects of writing a grammar. They literally have to put creative effort into designing languages people like to use! There are many issues in syntax design; I have a separate page of notes on this topic if you’re interested.

Parsing Theory and Practice

Irl, people use grammars to write correct compilers. The part of a compiler that uses a grammar to geenrate the parse tree from program text is called a parser. While you could write a parser from hand, most people use language libraries to essentially generate a parser from a grammar! Popular libraries include:


We’ve covered:

  • Motivation for Syntax
  • Difference between syntax and semantics
  • How to design a syntax
  • Syntax diagrams
  • Grammars
  • Lexical vs. Phrase rules
  • Ambiguity
  • Associativity
  • Abstract syntax
  • Tree grammars
  • The problem of context
  • Syntax IRL