The first step in designing a language is to answer some basic questions. (The answers might change as the language design evolves, but you must make best-effort answers before embarking on your language-design journey.) Rather than just telling you what those questions are, let’s actually go through the question and answer process for a simple language.
let x = 3 * (7 + sqrt 2) // Declarations introduce variables let y = abs x - 7 print y * x / 2 print 55.283 x = 1 // Wow, an assignment! print x
sqrt
and abs
functions. Numbers given in base-ten and are either integers or can have a single decimal point (but no exponent part). Identifiers are letters and digits starting with a letter (valid examples: $x$, $y$, $abc1xqWV$). Whitespace can be used liberally around (but not within) symbols. Comments start with //
and extend to the end of the line.It is one thing to talk about how your language looks and what programs in your language mean, but it is next-level to specify the structure and the meaning in precise, formal, mathematical, unambiguous, and stunningly beautiful notation. It’s near impossible to capture all the fine details in prose.
Note that we said structure and meaning. You need to define both. The formal terms for these aspects are syntax and semantics. We generally start with the syntax.
Let’s think through, and sketch out, the structure of programs in our language. We want to specify exactly which sequences of characters comprise well-structured, or well-formed, programs. Such a specification is called the concrete syntax of the language. First, a program will be a sequence of statements:
There will be three kinds of statements: declarations (to bring new variables to life), assignments (to update the value of an existing variable), and print statements:
Identifiers
An identifier is the computer science word for a name of something you define in your code, like a variable, constant, function, type, parameter, or similar thing. A keyword is a special word in your language, likelet
orif
orwhile
—these words generally do not name things, but sometimes, only in rare cases in some languages, they can. If a keyword is not allowed to be an identifier, we call it a reserved word.
Identifiers vs. Variables
Note how in the declaration rule we usedid
, but in the assignment rule we usedVar
. That’s because when declaring we are introducing a new identifier, but when assigning, we want to use an existing variable.
While a statement performs an action (with a side-effect), expressions represent the computation of a value. Expressions can be basic numbers or variables, or they can be made up of operators applied to other expressions. We can put parentheses around any expression, too:
In larger languages, variables can look complicated like players[0].name
, but Ael is so simple that variables cannot be anything more than just a single identifer:
We should specify the low-level details of our numeric literals and identifiers, too. We could get super-detailed with these, but for now let’s be super simple:
The diagrams are a great start, but they are a little vague in places (that “not a keyword” for example), and it’s really, really hard to process diagrams by machine. Text is much more dense then pictures, and more machine-friendly. So now it’s time to write a precise, formal grammar for our language. Jumping right in:
FIRST ATTEMPT, NOT FINAL
Ael {
Program = Statement+
Statement = let id "=" Exp --declare
| Var "=" Exp --assign
| print Exp --print
Exp = Exp ("+" | "-" | "*"| "/") Exp --binary
| ("-" | abs | sqrt) Exp --unary
| Var
| num
| "(" Exp ")" --parens
Var = id
num = digit+ ("." digit+)?
let = "let" ~alnum
print = "print" ~alnum
abs = "abs" ~alnum
sqrt = "sqrt" ~alnum
keyword = let | print | abs | sqrt
id = ~keyword letter alnum*
space += "//" (~"\n" any)* ("\n" | end) --comment
}
What is this cool notation? It’s Ohm. We’ll cover all of the details later, but study the following to begin:
space*
between all of the constructs on their right hand side.letter
(matches any unicode letter), digit
(matches any Basic Latin digit "0".."9"), alnum
(matches any letter or digit), end
(matches when you are at the end of the input), space
(matches any Unicode whitespace character), and any
(matches any character). The start rule is assumed to have an end
at the end, even if not explicitly shown.\n
), tab (\t
), backslash (\\
) and friends, as has become conventional in computer science. Also as per convention: *
means “zero or more”, +
means “one or more”, ?
means zero-or-one (i.e., “optional”), and |
means “or”.
+=
is used to add a new alternative to an existing rule (i.e., a+=b
abbreviates a=a|b
.) We’ve used it above to add comments to the language. Anything in the space
rule is skipped within phrase rules, so this is a perfect place to add comments!--
annotations are not important for the describing the syntax of the language, but are necessary when using the Ohm library to process the grammar. Details for when they are required can be found in the Ohm Documentation. You can ignore them until you starting using the Ohm tools.~
in the grammar means that letx = 2
parses as an assignment and not as the declaration let x = 2
.
~
in the grammar means that let = 5
is illegal. In other words, how, exactly, are keywords prohibited from being identifiers?
The lexical rules, with the exception of the special rule for space
, show us how to combine characters into words and punctuation which will call tokens. The phrase rules show us how to combine words and symbols into phrases (where the words and symbols can, if you wish, be surrounded by spacing).
This means it is best to think about the concrete syntax as actually having two distinct parts. The first part (the lexical rules) show us how we can tokenize a stream of characters into a stream of tokens, and the second part (the phrase rules) show us how to parse the token stream into the concrete syntax tree. Here’s an example. The source string:
print // ⿇🌿 420
is made up of these characters:
SPACE SPACE LATIN SMALL LETTER P LATIN SMALL LETTER R LATIN SMALL LETTER I LATIN SMALL LETTER N LATIN SMALL LETTER T TAB SOLIDUS SOLIDUS SPACE KANGXI RADICAL HEMP HERB LINEFEED DIGIT FOUR DIGIT TWO DIGIT ZERO
When we apply the lexical rules to this character stream, skipping the spaces (and comments), we get the token stream:
print
num(420)
So much simpler, right! Next, applying the phrase rules allows us to uncover the concrete syntax tree (also known as the parse tree):
A very important thing to note: The frontier of the parse tree is the token stream.
The parse tree ends at tokens, not characters
Please take the time to consider how silly it would be if the parse tree expanded all of the lexical rules down to characters. Having the parse tree stop at tokens is a good example of what we would call “breaking a complex problem down into simpler parts.”
How did we do with our grammar? Well, you know, there is one issue: It can parse the string 9-3*7
in two ways:
Having more than one parse tree for a given input string means that our syntax description is ambiguous. This isn’t necessarily bad, because we could (if we wanted to) leave the ambiguity in the grammar but supply extra informal explanatory text in our language definition that determines how the ambiguity is to be resolved. 🙄🙄🙄 Okay, if you don’t like that idea, we can get rid of this particular ambiguity by adding extra rules in the grammar itself that make an operator precedence emerge:
SECOND ATTEMPT, STILL NOT FINAL Ael { Program = Statement+ Statement = let id "=" Exp --declare | Var "=" Exp --assign | print Exp --print Exp = Term (("+" | "-") Term)* Term = Factor (("*"| "/") Factor)* Factor = Var | num | ("-" | abs | sqrt) Factor --unary | "(" Exp ")" --parens Var = id num = digit+ ("." digit+)? let = "let" ~alnum print = "print" ~alnum abs = "abs" ~alnum sqrt = "sqrt" ~alnum keyword = let | print | abs | sqrt id = ~keyword letter alnum* space += "//" (~"\n" any)* ("\n" | end) --comment }
Great! Now there is one and only one parse tree for that previously problematic expression:
Do you see how we made precedence happen? Here’s what we are saying:
But we might want to do one more thing. The way things stand now, the parse tree for 3-8-5
looks pretty flat:
It doesn’t suggest whether we mean to compute (3-8)-5
or 3-(8-5)
. As language designers, we should make the choice of left-associativity or right-associativity for each operator. (Oh, operators can also be non-associative — more on that later.) Let’s make all of our binary operators left-associative:
Ael {
Program = Statement+
Statement = let id "=" Exp --declare
| Var "=" Exp --assign
| print Exp --print
Exp = Exp ("+" | "-") Term --binary
| Term
Term = Term ("*"| "/") Factor --binary
| Factor
Factor = Var
| num
| ("-" | abs | sqrt) Factor --unary
| "(" Exp ")" --parens
Var = id
num = digit+ ("." digit+)?
let = "let" ~alnum
print = "print" ~alnum
abs = "abs" ~alnum
sqrt = "sqrt" ~alnum
keyword = let | print | abs | sqrt
id = ~keyword letter alnum*
space += "//" (~"\n" any)* ("\n" | end) --comment
}
And we get:
If you are using Ohm to specify your grammar, you can use the Ohm Editor to experiment with and test your grammar as you write it! We’ll do an in-class demo of it.
Get familiar with this tool. Add lots of test cases, both positive and negative.
We had to do a lot of specification work to define programs as strings of characters. For example, we needed extra tokens (parentheses) and extra categories (Term and Factor) in order to capture precedence, and use other tricks to capture associativity. In order to simplify the remainder of our specification (we haven’t done semantics yet) we need to abstract away this messy stuff. We want to transform concrete syntax trees into simpler (but no less precise!!) abstract syntax trees:
Can we formalize this transformation mathematically? Of course we can! Let’s define function $\mathscr{A}$ to turn a parse tree (CST) into the corresponding AST. We write trees in the form $(A\;B\;C\;...)$ where $A$ is the root and $B$, $C$, ... are the children.
When translating to an abstract syntax, we not only get rid of intermediate syntax categories, but we also drop parentheses, commas, semicolons, brackets, and pretty much all punctuation. Punctuation really only exists to give a “tree structure” to our linear program text.
3-5-8-2*8-9-7-1
according to this grammar and abstract syntax.
Note that our syntax did not capture the requirement that “you cannot use an identifier in a statement unless it was previously assigned.” You might want to try to build a grammar to do that, but you will find it notoriously difficult: it is a context-sensitive requirement! Context sensitivity is so hard to capture in a grammar that we specify separate semantic rules to enforce them. Semantic rules that are intended to be checked at compile-time, before the program runs, are called static semantic rules.
In our formal language definition, we capture these rules in a Static Semantic Definition: a function that says, for a given program, whether or not it is semantically correct. Basically we want to think about it like this:
So all of our statements and expressions have to be analyzed for correctness within a context that holds which identifiers have been previously declared. The analysis of a declaration statement will update the context (if necessary) so all future statements will have it available. Belive it or not, you can capture this requirements formally by defining a mathematical function that says whether or not an AST is semantically okay:
Syntax vs. Static Semantics
The division between what is syntax (what’s in the grammar) and what is static semantics (checked by custom functions outside the grammar) is sometimes a matter of choice. There are some things that are not too bad to put in the grammar, but they might just be ugly. In this case, feel free to use the semantics.
A dynamic semantics computes the meaning of an abstract syntax tree. In Ael:
We can formally define functions that map abstract syntax trees to their meanings:
Things seem to have gotten complicated! So let’s go through a complete example, from character string to program meaning, step-by-step, explaining everything along the way.
Let’s determine the meaning of this string. We’ll walk through everything we did so far:
let yo=8 print -yo * (22- 7)// 🎉
The character string is:
LATIN SMALL LETTER L LATIN SMALL LETTER E LATIN SMALL LETTER T SPACE LATIN SMALL LETTER Y LATIN SMALL LETTER O EQUAL SIGN DIGIT EIGHT SPACE LATIN SMALL LETTER P LATIN SMALL LETTER R LATIN SMALL LETTER I LATIN SMALL LETTER N LATIN SMALL LETTER T LINEFEED SPACE SPACE HYPHEN LATIN SMALL LETTER Y LATIN SMALL LETTER O SPACE TAB ASTERISK SPACE LEFTPAREN DIGIT TWO DIGIT TWO HYPHEN SPACE SPACE DIGIT SEVEN RIGHTPAREN SOLIDUS SOLIDUS SPACE PARTY POPPER
We first apply the lexical rules to tokenize the string, remembering to skip the spaces (and comments):
let
id(yo)
=
num(8)
print
-
id(yo)
*
(
numlit(22)
-
numlit(7)
)
Applying the phrase rules uncovers the concrete syntax tree:
The static semantics is applied to the concrete syntax tree to produce the abstract syntax tree: $$ \begin{array}{l} ( \mathsf{Program} \\ \;\;\;\; ( \mathsf{VarDec} \; \mathrm{yo} \; 8 ) \\ \;\;\;\; ( \mathsf{Print} \\ \;\;\;\;\;\;\;\; ( \mathsf{Times} \\ \;\;\;\;\;\;\;\;\;\;\;\; ( \mathsf{Negate} \; ( \mathsf{IdExp}\;\mathrm{yo})) \\ \;\;\;\;\;\;\;\;\;\;\;\; ( \mathsf{Minus} \; 22 \; 7) ))) \end{array} $$
which we can read more easily in tree form:
which, informally can be shortened to:
(The shortened form plays a little loose with the node names, and it elides out the IdExp
nodes, because those are, you know, pretty obvious, right?)
Finally, we apply the dynamic semantics to the AST to compute the meaning:
𝒫(Program (VarDec yo 8) (Print (Times (Negate (IdExp yo)) (Minus 22 7)))) = = (𝒮(Print (Times (Negate (IdExp yo)) (Minus 22 7)))(𝒮(VarDec yo 8)({},[]))) [1] = (𝒮(Print (Times (Negate (IdExp yo)) (Minus 22 7)))(𝒮(VarDec yo 8)({},[]))) [1] = ({yo:8}, [ℰ(Times (Negate (IdExp yo)) (Minus 22 7)){yo:8}]) [1] = [ℰ(Times (Negate (IdExp yo)) (Minus 22 7)){yo:8}] = [ℰ(Negate (IdExp yo)){yo:8} × ℰ(Minus 22 7){yo:8}] = [ℰ(Negate 8){yo:8} × (ℰ(22){yo:8} − ℰ(7){yo:8})] = [-ℰ(8){yo:8} × (22 − 7)] = [-8 × (22 − 7)] = [-120]
So the sequence of values the program prints consists of just one value, -120.
DON’T PANIC
You’ve just been introduced to the wonderful field of denotational semantics, and you’ve seen that programming languages can be given a formal, mathematical definition. That does not mean, however, that to write a compiler you have to deeply understand, let alone even use, this stuff! Still, it’s good to be aware of.
Again it’s ok: Real compilers and interpreters do their translations and evaluations in code, not semantic equations. That’s what you’ll do too. Hang in there!
98 * (247 - 6) / 3 / 100 - 701
.
(* (- 8) (- 22 7))
is the actual way to write arithmetic expressions. Hint: several do exist. Why would anyone design a language this way?
We’ve scratched the surface of what you need to know to formally specify a language. It’s fun, there there’s a lot more to do. Here’s what is coming:
We’ve covered: