It’s really hard to write a grammar that can determine all of the legality rules for a language. So in practice, we don’t put too much effort into the grammar. We let the grammar take care only of context-free rules that can reject, for example:
clask C {int y/-&&^while var xyz += ((((((((x;}
but leave rules like identifier uniqueness, type checking, parameter matching—and anything else that requires examining context—out of the grammar. So we say something like this:
class C {int y = x;}
is correct according to the grammar—some might even say it is syntactically correct. But the program still should not be allowed to run, as there is an error that can be detected by looking at the source code. Because the error is detectable before the program is executed, this is a kind of static error, known as a static semantic error or contextual error, emphasizing the fact that the rule violation is not detected by the grammar.
In order to enforce the contextual constraints, the parse tree or abstract syntax tree is transformed into a graph structure known as a program representation. Although the representation is a graph, many people speak of it as a decorated abstract syntax tree, where the “decorations” are contextual information such as types, access modifiers, and other semantic details.
Contextual rules are language-dependentStatic analysis applies to some languages more than others. “Dynamic languages” do very little (perhaps no) analysis at compile time. Some languages are super static, permitting almost every check to be done before execution. These notes try to consider tons of possible semantic checks, whether or not most languages implement them.
The set of static constraints varies so much from language to language, so rather than focusing on a particular language, we’ll just lay out a few that you might see from time to time. This set of constraints is certainly incomplete, but it is a start.
break and continue statements may only appear in a loop. return statements may only appear in a function. (It’s possible to encode these restrictions in the grammar, but it would be ugly).
p.x, $p$ must have a dictionary type and the field $x$ must be a field of the type of $p$. Or $p$ is a module, package, or namespace, and $x$ is an identifier marked as exportable from it.p[x], $p$ must have an array type and $x$ must have a type compatible with the index type of $p$’s type.We saw earlier that most languages define their contextual constraints, informally, in prose.
However, contextual constraints can be specified formally, just not in a context free grammar nor even in most analytic grammar formalisms. Many techniques exist. Let’s look at two: attribute grammars and static semantic inference rules.
Here is an attribute grammar that produces a postfix string from an infix string, using semantic functions. We'll use this grammar for the source language (the infix):
E ⟶ E addop T | T
T ⟶ T mulop F | F
F ⟶ "-" F | "(" E ")" | intlit | id
addop ⟶ "+" | "-"
mulop ⟶ "*" | "/"
For the target language, we’ll use "~" for unary negation in the postfix formulation in order to avoid parentheses. As long as all operators have a fixed arity, parentheses are not necessary.
Semantic functions are attached to each rule, after simplifying the grammar by removing the convenience symbols such as |, ?, *, and +:
| Grammar Rules | Semantic Rules |
|---|---|
E ⟶ E1 addop T | E.post := E1.post • T.post • addop.sourceString
|
E ⟶ T | E.post := T.post
|
T ⟶ T1 mulop F | T.post := T1.post • F.post • mulop.sourceString
|
T ⟶ F | T.post := F.post
|
F ⟶ "-" F1 | F.post := F1.post • "~"
|
F ⟶ "(" E ")" | F.post := E.post
|
F ⟶ intlit | F.post := intlit.sourceString
|
F ⟶ id | F.post := id.sourceString
|
See how it works? Each symbol gets some properties (called attributes) as necessary, and we make rules that show how to assign attribute values. In this example our attributes were sourceString (for the source code text of the construct) and post (for the postfix representation of the construct). There’s a lot of theory here that we won’t cover, like whether attributes are synthesized or inherited, but this example should work for introducing what attribute grammars look like.
Let’s apply the rules to an example we’ll come up with in class.
Now here is the specification using action routines. We don’t have to put the grammar into a simplified, standard form for these, but we should get rid of the left recursion. Let’s use the following grammar:
E ⟶ T (addop T)*
T ⟶ F (mulop F)*
F ⟶ "-" F | "(" E ")" | intlit | id
addop ⟶ "+" | "-"
mulop ⟶ "*" | "/"
Now our attribute grammar looks like this:
E ⟶ T {E.post := T.post} (addop T {E.post •= T.post • addop.sourceString})*
T ⟶ F {T.post := F.post} (mulop F {T.post •= F.post • mulop.sourceString})*
F ⟶ "-" F1 {F.post := F1.post • "~"}
F ⟶ "(" E {F.post := E.post} ")"
F ⟶ intlit {F.post := intlit.sourceString}
F ⟶ id {F.post := id.sourceString}
Let’s do some attribute computations.
Note how semantic functions are written separately from the grammar rules, while action routines are embedded directly within the grammar rules.
An attribute grammar for finding the derivative of polynomials, using semantic functions:
| Grammar Rules | Semantic Rules |
|---|---|
P ⟶ P1 + T | P.deriv := P1.deriv • "+" • T.deriv
|
P ⟶ P1 - T | P.deriv := P1.deriv • "-" • T.deriv
|
P ⟶ T | P.deriv := T.deriv
|
P ⟶ - T | P.deriv := "-" • T.deriv
|
T ⟶ C | T.deriv := "0"
|
T ⟶ C x | T.deriv := C.val
|
T ⟶ C x ^ E | T.deriv := product(E.val,C.val) • "x^" • pred(E.val)
|
T ⟶ x | T.deriv := "1"
|
T ⟶ x ^ E | T.deriv := E.val • "x^" • pred(E.val)
|
C ⟶ const | C.val := valueof(const)
|
E ⟶ const | E.val := valueof(const)
|
E ⟶ - const | E.val := "-" • valueof(const)
|
With action routines:
P ⟶ {P.deriv := ""} ("-" {P.deriv := "-"})?
T {P.deriv •= T.deriv}
(
("+" {P.deriv •= "+"} |"-" {P.deriv •= "-"})
T {P.deriv •= T.deriv}
)*
T ⟶ {c := 1; e := 1} (N {c := N.value})?
"x" ("^" ("-" {e := -1})? N {e *= -N.value})?
{T.deriv := product(c,e) • "x^" • pred(e)}
| N {T.deriv := "0"}
Note that Ohm feels a lot like writing attribute grammars with semantic functions (with the semantic operations completely outside of the grammar, which is amazing!) However Ohm allows arbitrary JavaScript code in its semantic functions, which is more flexible than just slapping attributes on to parse tree nodes.
The theory behind attribute grammars is pretty rich. In the compiler literature, much has been written about the order of attribute evaluation, and whether attributes bubble up the parse tree or can be passed down or sideways through the three. It’s all fascinating stuff, and worthwhile when using certain compiler generator tools. But you can always just use Ohm and enforce contextual rules with code.
Another approach is to just treat contextual rules as part of the semantics of a language, albeit not the same semantics that defines the runtime effects of a program. It’s static semantics, and you can use the techniques of denotational or operational semantics to enforce the contextual rules, too.
For example, here’s a way to define the contextual constraints of the simple language Astro. Here $\vdash p$ means program $p$ is statically correct; $c \vdash e$ means expression $e$ is correct in context $c$, and $c \vdash s \Longrightarrow c'$ means that statement $s$ is correct in context $c$ and subsequent statements must be checked in context $c'$. Thinking imperatively, statically analyzing a statement “updates” the context.
The context used above was a function mapping identifiers to either (1) $\mathsf{undef}$, (2) a pair consisting of $\mathsf{num}$ and either $\mathsf{RO}$ or $\mathsf{RW}$, or (3) a pair consisting of $\mathsf{fun}$ and an arity.
Previously, we gave formal definitions of Astro and Bella in which static and dynamic semantics were defined together. If we do decide to make a static semantics on its own, then the dynamic semantics can become simpler, since we can assume all the static checks have already been done. This is, in fact, how real compilers and interpreters work.
Astro and Bella are fairly simple languages. Specifically, every legal expression has the type number, so we required only the simple notation $c \vdash e$ (meaning $e$ is legal in context $c$). In languages in which expressions may have multiple types, we often use the notation $c \vdash e : \tau$ (meaning $e$ has type $\tau$ in context $c$). There exist languages in which expressions “update” the context, so we require notations such as $c \vdash e : \tau \Longrightarrow c'$.
Logically speaking we do static analysis by traversing the CST or AST, decorating it, and checking things. We do quite a few tasks here, such as name analysis, type analysis, control flow analysis, and data flow analysis. These tasks are often interleaved.
What do we mean by decorating? Suppose we had this source code:
function gcd(x: int, y: int) {
return y == 0 ? x : gcd(y, x % y);
}
print(gcd(5023427, 920311));
At syntax time, you only know that identifiers (of types, of variables, of functions, of whatever) are strings! You don't know what entities they refer to. You don’t really know the types of anything: you just have strings that identify the types for some parameters, but the types of expressions have to be inferred by the analyzer. All you get from the grammar is this:
Program
statements: [
FunDecl (name: "gcd", returnType: null)
params: [
Param (name: "x", type: "int")
Param (name: "y", type: "int")]
statements: [
ReturnStmt
expression: Conditional
test: Binary (op: "==", left: "y", right: 0),
consequent: "x",
alternate: FunCall
callee: "gcd",
args: [
"y",
Binary (op: "%", left: "x", right: "y")]]]
Print
args: [Call (callee: "gcd", args: [5023427, 920311])]]
Again, make sure you understand: There are no connections between the use of any identifier and the entity to which it should resolve. There are no types on any of the expressions, so they all have to be inferred. We do have type annotations on the parameters, though in some languages these are optional. There was no explicit type annotation for the return type of the function, so this would have to be inferred.
Static analysis is the process of transforming the syntax tree above into a program representation. For this example, the representation might look like this:

Let’s look at a high-level how-to.
We have to figure out which names refer to which (declared) entities. The idea is somewhat simple: when an identifier is declared (or defined), you record it somewhere. The conventional term for the place you store the identifiers, and the entities they refer to, is symbol table.
A symbol table is a collection of mappings from names (identifiers) to entities. That is all.
How can we employ symbol tables? Should you (1) create a new symbol table for every scope, or (2) manage just one symbol table? What are the tradeoffs?
| Table per Scope | Single Table | |
|---|---|---|
| Basic Idea | There is a symbol table for every scope. Each identifier is mapped to single entity (unless overloading is permitted). | There is only one table. Each identifier is mapped to a list of entities, with each entity tagged with its scope. |
| On scope entry | Create a new, empty symbol table. Set this table’s parent to the current table. | Nothing really. Or you could increment a counter so you always have a nesting level associated with each entity. |
| On a declaration | Add identifier/entity pair to table. | Push entity onto list for the identifier. |
| On a use | Search current table; if not found, recursively search the parent table (and so on). | Search the list. |
| On scope exit | Nothing, really. | Pop all entities added in this scope. |
| Advantages | Easy to implement. Tables can be immutable. | Lookup is efficient. |
| Disadvantages | Search can be slow, especially for globals! | Mutable. Scope exit expensive. |
Every time an identifier is encountered, the symbol table(s) are consulted to determine (1) whether the entity was declared in scope, and if so, (2) what entity the identifier refers to.
We also need to determine the type of every expression.
Type checking is a major component of static analysis! There are two major parts:
The rules vary widely from language to language. Some languages have a massively rich system of static types (e.g., 10 or more different distinct types for numbers alone). Some languages have incredibly intricate and detailed rules for which types are compatible with others.
Type inference is usually not so bad for literals.
Usually? Heh, let’s come up with examples in class where it isn’t always so easy.
For identifiers, many languages attach type constraints in different ways, including not at all. Sometimes the type constraint on the variable has to be inferred from context. You can imagine different degrees of type inference requirements on identifiers:
| How much? | Characteristics | Exemplars |
|---|---|---|
| Low | Very verbose. Pretty much every declaration (of every variable, every field, every parameter, every function) mentions its type. | Java, prior to version 10 |
| Medium | Generally variable declarations don’t need types (you can say var x = 100 or var message = "Hello"), but parameter types and function return types must be explicit.
| Go, Swift, Rust, C# |
| High | Virtually every identifier declaration can be given without a type. The inference engine has to do a ton of work. But the code looks so pretty and light, and we still get type safety guarantees before execution! Often uses a Hindley-Milner type system (with type variables) and perhaps, type classes. | Standard ML, OCaml, F#, Haskell, Elm |
The rules for type compatibility vary widely from language to language. Some common considerations include:
Type checking also involves understanding overloading rules, the polymorphism mechanisms for the language, type inference rules, and how and when the language uses covariance, contravariance, invariance, and bivariance. What do those latter terms mean? Suppose Dog is a subclass of Animal:
Dog <: Animal
Now what is the relationship between List<Dog> and List<Animal>? There are four cases:
List<Dog> <: List<Animal> — covariance
List<Animal> <: List<Dog> — contravariance
Then it gets more complicated. Do you remember, offhand, the relationships between List<Dog>, List<Animal>, and:
List<? extends Dog> List<? extends Animal> List<? super Dog> List<? super Animal> List<?> List<Object>
Control Flow Analysis (CFA) is what we do when we build and query the control flow graph (CFG). This can help us find functions that are never called, code that is unreachable, some infinite loops, paths without return statements, etc.
Read about CFA and CFGs at Wikipedia.
In Data Flow Analysis (DFA), we determine where identifiers are declared, when they are initialized, when they are updated, and who reads (refers to) them. This tells us when identifiers are used but not declared, used but not initialized, declared but never used, etc. Also we can note for each identifier at each point in the program, which other entities could refer to them.
Read about DFA at Wikipedia.
Languages like Rust also do ownership and borrowing computations as part of data flow analysis. We might cover this in class. If we don’t, read these three sections from The Rust Programming Language Book: Ownership, References and Borrowing, and Lifetimes.
A compiler’s static analyzer only needs to check whether programs violate language rules. But there are also many such statically ”correct” programs that are written weirdly, extremely error prone, under-performant, resource-leaking, subject to race conditions, vulnerable to attacks, or produce completely unexpected (in other words, wrong) results when run. These problems can be captured by a linter.
Look at these cool lists:
You should integrate a linter into your text editor or IDE. Seeing both language errors (from the compiler) and linter errors while you write your program is a Good Thing.
Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.
Here is an incomplete list:
We’ve covered: