It’s really hard to write a grammar that can determine all of the legality rules for a language. So in practice, we don’t put too much effort into the grammar. We let the grammar take care only of context-free rules that can reject, for example:
clask C {int y/-&&^while var xyz += ((((((((x;}
but leave rules like identifier uniqueness, type checking, parameter matching—and anything else that requires examining context—out of the grammar. So we say something like this:
class C {int y = x;}
is correct according to the grammar—some might even say it is syntactically correct. But the program still should not be allowed to run, as there is an error that can be detected by looking at the source code. Because the error is detectable before the program is executed, this is a static error, and finding these errors is part of the activity known as static analysis. Whether you call these kinds of errors “static semantic errors” or “context-sensitive syntax errors” is really up to you.
In order to enforce the contextual constraints, it is necessary to decorate the parse tree or AST with contextual information.
Static analysis applies to some languages more than others. “Dynamic languages” do almost no analysis at compile time. Some languages are super static, permitting almost every check to be done before execution. These notes try to consider tons of possible semantic checks, whether or not most languages implement them.
The set of static constraints varies so much from language to language, so rather than focusing on a particular language, we’ll just lay out a few that you might see somewhere:
break
and continue
statements may only appear in a loop. return
statements may only appear in a function. (It’s possible to encode these restrictions in the grammar, but it would be ugly).
p.x
, $p$ must have a dictionary type and the field $x$ must be a field of the type of $p$. Or $p$ is a module, package, or namespace, and $x$ is an identifier marked as exportable from it.
p[x]
, $p$ must have an array type and $x$ must have a type compatible with the index type of $p$’s type.
It turns out that contextual constraints can be specified formally, just not in a context free grammar. Many techniques exist. Let’s look at a couple.
Here is an attribute grammar that produces a postfix string from an infix string, using semantic functions. We'll use this grammar for the source language (the infix):
E ⟶ E "+" T | E "-" T | T T ⟶ T "*" F | T "/" F | F F ⟶ "-" F | "(" E ")" | intlit | id
For the target language, we’ll use "~
" for unary negation in the postfix formulation in order to avoid parentheses. As long as all operators have a fixed arity, parentheses are not necessary.
Grammar Rules | Semantic Rules |
---|---|
E → E1 "+" T | E.post := E1.post • T.post • "+"
|
E → E1 "-" T | E.post := E1.post • T.post • "-"
|
E → T | E.post := T.post
|
T → T1 "*" F | T.post := T1.post • F.post • "*"
|
T → T1 "/" F | T.post := T1.post • F.post • "/"
|
T → F | T.post := F.post
|
F → "-" F1 | F.post := F1.post • "~"
|
F → "(" E ")" | F.post := E.post
|
F → intlit | F.post := intlit.sourceString
|
F → id | F.post := id.sourceString
|
See how it works? Each symbol gets some properties (called attributes) as necessary, and we make rules that show how to assign attribute values. There’s a lot of theory here that we won’t cover, like whether attributes are synthesized or inherited, but you should work on gaining a basic understanding of what attribute grammars look like.
Let’s do some derivations.
Now here the specification using action routines. For these, it's best to change up the grammar just a bit.
E ⟶ T (addop T)* T ⟶ F (mulop F)* F ⟶ "-" F | "(" E ")" | intlit | id
Now our attribute grammar looks like this:
E ⟶ T {E.post := T.post} (addop T {E.post •= T.post • addop.text})* T ⟶ F {T.post := F.post} (mulop F {T.post •= F.post • mulop.text})* F ⟶ - F1 {F.post := F1.post • "~"} F ⟶ "(" E {F.post := E.post} ")" F ⟶ id {F.post := image(id)} F ⟶ intlit {F.post := image(intlit)} addop ⟶ "+" {addop.text := "+"} | "-" {addop.text := "-"} mulop ⟶ "*" {mulop.text := "*"} | "/" {mulop.text := "/"}
An attribute grammar for finding the derivative of polynomials, using semantic functions:
Grammar Rules | Semantic Rules |
---|---|
P → P1 + T | P.deriv := P1.deriv • "+" • T.deriv
|
P → P1 - T | P.deriv := P1.deriv • "-" • T.deriv
|
P → T | P.deriv := T.deriv
|
P → - T | P.deriv := "-" • T.deriv
|
T → C | T.deriv := "0"
|
T → C x | T.deriv := C.val
|
T → C x ^ E | T.deriv := product(E.val,C.val) • "x^" • pred(E.val)
|
T → x | T.deriv := "1"
|
T → x ^ E | T.deriv := E.val • "x^" • pred(E.val)
|
C → const | C.val := valueof(const)
|
E → const | E.val := valueof(const)
|
E → - const | E.val := "-" • valueof(const)
|
With action routines:
P ⟶ {P.deriv := ""} ("-" {P.deriv := "-"})? T {P.deriv •= T.deriv} ( ("+" {P.deriv •= "+"} |"-" {P.deriv •= "-"}) T {P.deriv •= T.deriv} )* T ⟶ {c := 1; e := 1} [N {c := N.value}] "x" ("^" ("-" {e := -1})? N {e *= -N.value})? {T.deriv := product(c,e) • "x^" • pred(e)} | N {T.deriv := "0"}
Note that Ohm feels a lot like writing attribute grammars with semantic functions. However Ohm allows arbitrary JavaScript code in its semantic functions, which is more flexible than just slapping attributes on to parse tree nodes.
The theory behind attribute grammars is pretty rich. In the compiler literature, much has been written about the order of attribute evaluation, and whether attributes bubble up the parse tree or can be passed down or sideways through the three. It’s all fascinating stuff, and worthwhile when using certain compiler generator tools. But you can always just use Ohm and enforce contextual rules with code.
Another approach is to just treat contextual rules as part of the semantics of a language, albeit not the same semantics that defines the runtime effects of a program. It’s static semantics, and you can use the techniques of denotational or operational semantics to enforce the contextual rules, too.
For example, here’s a way to define the contextual constraints of Astro. Here $\vdash p$ means program $p$ is statically correct; $c \vdash e$ means expression $e$ is correct in context $c$, and $c \vdash s \Longrightarrow c'$ means that statement $s$ is correct in context $c$ and subsequent statements must be checked in context $c'$. In other words, statically analyzing a statement “updates” the context.
Previously, we gave formal definitions of Astro and Bella in which static and dynamic semantics were defined together. If we do decide to make a static semantics on its own, then the dynamic semantics can become simpler, since we can assume all the static checks have already been done. This is, in fact, how real compilers and interpreters work.
Logically speaking we do static analysis by traversing the CST or AST, decorating it, and checking things. We do quite a few tasks here, such as name and type resolution, control flow analysis, and data flow analysis. These tasks are often interleaved.
First we figure out which names refer to which (declared) entities, and what the types are for each expression. The first part uses is sometimes called scope analysis and involves symbol tables and the second does (some degree of) type inference.
A symbol table is a collection of mappings from names (identifiers) to entities. That is all.
There is a design thing to be aware of, though. Should you (1) create a new symbol table for every scope, or (2) manage just one symbol table? What are the tradeoffs?
Table per Scope | Single Table | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Basic Idea | There is a symbol table for every scope. Maps identifier to (usually a) single entity. | There is only one table. Maps identifier to list of entities.
On scope entry | Create a new, empty symbol table. Set this table’s parent to the current table. | Nothing really. Or you could increment a counter so you always have a nesting level associated with each entity.
| On a declaration | Add identifier/entity pair to table. | Push entity onto list for the identifier.
| On a use | Search current table; if not found, recursively search the parent table (and so on). | Search the list.
| On scope exit | Nothing, really. | Pop all entities added in this scope.
| Advantages | Easy to implement. Tables can be immutable. | Lookup is efficient.
| Disadvantages | Search can be slow, especially for globals! | Mutable. Scope exit expensive.
| |
So how do we determine (infer) the type of every expression? This can get interesting. It depends on the language design. Statically typed languages generally fall into a spectrum of how much type inference they allow:
How much? | Characteristics | Exemplars |
---|---|---|
Low | Very verbose. Pretty much every declaration (of every variable, every field, every parameter, every function) mentions its type. | Java, prior to version 10 |
Medium | Generally variable declarations don’t need types (you can say var x = 100 or var message = "Hello" ), but parameter types and function return types must be explicit.
| Go, Swift, Rust, C# |
High | Virtually every identifier declaration can be given without a type. The inference engine has to do a ton of work. But the code looks so pretty and light, and we still get type safety guarantees before execution! Often uses a Hindley-Milner type system (with type variables) and perhaps, type classes. | Standard ML, OCaml, F#, Haskell, Elm |
So how to do type checking? You follow the type-equivalence and type-compatibility rules. These vary from language to language and are generally very straightforward, until you get to parameterized types, or generics. Why are these not straightforward? Well suppose Dog
is a subclass of Animal
:
Dog <: Animal
Now what is the relationship between List<Dog>
and List<Animal>
? There are four cases:
List<Dog> <: List<Animal>
— covariance
List<Animal> <: List<Dog>
— contravariance
Then it gets more complicated. Do you remember, offhand, the relationships between List<Dog>
, List<Animal>
, and:
List<? extends Dog> List<? extends Animal> List<? super Dog> List<? super Animal> List<?> List<Object>
Control Flow Analysis (CFA) is what we do when we build and query the control flow graph (CFG). This can help us find functions that are never called, code that is unreachable, some infinite loops, paths without return statements, etc.
Read about CFA and CFGs at Wikipedia.
In DFA, we determine where identifiers are declared, when they are initialized, when they are updated, and who reads (refers to) them. This tells us when identifiers are used but not declared, used but not initialized, declared but never used, etc. Also we can note for each identifier at each point in the program, which other entities could refer to them.
Read about DFA at Wikipedia.
Languages like Rust also do ownership and borrowing computations as part of data flow analysis. We might cover this in class. If we don’t read these three sections from The Rust Programming Language Book: Ownership, References and Borrowing, and Lifetimes.
A complier’s static analyzer only needs to check whether programs violate language rules. But there are also many such statically ”correct” programs that are written weirdly, extremely error prone, under-performant, resource-leaking, subject to race conditions, or produce completely unexpected (in other words, wrong) results when run. These problems can be captured by a linter.
Look at these cool lists:
You should integrate a linter into your text editor or IDE. Seeing both language errors (from the compiler) and linter errors while you write your program is a Good Thing.
We’ve covered: