Static Analysis

So, the parser says your program checks out according to the grammar. Doesn’t mean it should be allowed to run, does it?

Motivation

It’s really hard to write a grammar that can determine all of the legality rules for a language. So in practice, we don’t put too much effort into the grammar. We let the grammar take care only of context-free rules that can reject, for example:

clask C {int y/-&&^while var xyz += ((((((((x;}

but leave rules like identifier uniqueness, type checking, parameter matching—and anything else that requires examining context—out of the grammar. So we say something like this:

class C {int y = x;}

is correct according to the grammar—some might even say it is syntactically correct. But the program still should not be allowed to run, as there is an error that can be detected by looking at the source code. Because the error is detectable before the program is executed, this is a static error, and finding these errors is part of the activity known as static analysis. Whether you call these kinds of errors “static semantic errors” or “context-sensitive syntax errors” is really up to you.

Exercise: Many language rules are not context free, and this can be proven. For example, you can indeed prove that enforcing the rule that the number of arguments in a call must match the number of parameters of the callee is not context-free. Prove this, or find, read, and understand an existing proof.

Exercise: Come up with some English sentences that are syntactically correct but semantically nonsense. Here are two to get you started: “Colorless green ideas sleep furiously” (Chomsky’s famous example), and “One gaseous solid rocks smelling the color seventeen will think yesterday.”

In order to enforce the contextual constraints, it is necessary to decorate the parse tree or AST with contextual information.

Static analysis applies to some languages more than others. “Dynamic languages” do almost no analysis at compile time. Some languages are super static, permitting almost every check to be done before execution. These notes try to consider tons of possible semantic checks, whether or not most languages implement them.

Examples of Contextual Constraints

The set of static constraints varies so much from language to language, so rather than focusing on a particular language, we’ll just lay out a few that you might see somewhere:

Identifiers have to be declared before they are used.
Identifiers may not be used in a place where they may have not yet been initialized.
Identifier declarations must be unique within a scope, unless a language provides multiple namespace buckets (e.g., functions vs non-functions), or it permits overloading.
When assigning expression $e$ to variable declared with type $t$ (or passing argument $e$ to a parameter $v$ of type $t$, or returning $e$ from a function with return type $t$), the type of $e$ must be compatible with type $t$. This may mean many things (that vary from language to language), but common rules are things like:
- $e$.type is the exact same type as $t$, or
- $e$.type and $t$ have the same name, or perhaps the same structure, or
- $e$.type is a subtype of $t$, or
- The programmer has set up an implicit conversion, or coercion, from $e$.type to $t$, or
- $e$.type can be widened to $t$, e.g., byte→short→int→long→float→double... (though some languages notably disallow this!), or
- $e$.type is a bottom type, or
- $t$ is a top type.
Type checking also involves understanding overloading rules, the polymorphism mechanisms for the language, type inference rules, and how and when the language uses covariance, contravariance, invariance, and bivariance.
Identifiers must be used according to (1) their scope, (2) their access modifiers (private, protected, package, public, whitelist, blacklist, etc.), and (3) any other meta-level attributes (enumerable, configurable, writable, deletable, callable, etc.). Sometimes scope rules are obvious. Sometimes both explicit import and explicit export is required.
Arguments must match up with parameters in terms of number, order, name, mode, etc. Languages offer enormous flexibility here. Sometimes the number of arguments can be less or more than the number of parameters. Sometimes names are required for arguments and sometimes not.
break and continue statements may only appear in a loop. return statements may only appear in a function. (It’s possible to encode these restrictions in the grammar, but it would be ugly).
All paths through a function must have a return.
In an expression like p.x, $p$ must have a dictionary type and the field $x$ must be a field of the type of $p$. Or $p$ is a module, package, or namespace, and $x$ is an identifier marked as exportable from it.
In an expression like p[x], $p$ must have an array type and $x$ must have a type compatible with the index type of $p$’s type.
All possible matches in a pattern match expression, or a switch statement, must be covered.
When inheriting from a base class with abstract methods, or implementing an interface, all abstract methods must be implemented, or the derived class must be declared abstract.
All declared local variables must be subsequently read, and declared private functions must be called.

Formal Specification of Context Rules

It turns out that contextual constraints can be specified formally, just not in a context free grammar. Many techniques exist. Let’s look at a couple.

Attribute Grammars

In an attribute grammar, we embed either action routines or semantic functions into a traditional (context-free) grammar. The idea is that each grammar variable can have attached to it any number of attributes (which can be whatever you like). When the grammar rules are being matched during a parse, the rules or actions fire, updating the attributes. Let’s learn via example.

Example: Infix to Postfix Conversion

Here is an attribute grammar that produces a postfix string from an infix string, using semantic functions. We'll use this grammar for the source language (the infix):

E ⟶ E "+" T | E "-" T | T
T ⟶ T "*" F | T "/" F | F
F ⟶ "-" F | "(" E ")" | intlit | id

For the target language, we’ll use "~" for unary negation in the postfix formulation in order to avoid parentheses. As long as all operators have a fixed arity, parentheses are not necessary.

Grammar Rules	Semantic Rules
`E → E₁ "+" T`	`E.post := E₁.post • T.post • "+"`
`E → E₁ "-" T`	`E.post := E₁.post • T.post • "-"`
`E → T`	`E.post := T.post`
`T → T₁ "*" F`	`T.post := T₁.post • F.post • "*"`
`T → T₁ "/" F`	`T.post := T₁.post • F.post • "/"`
`T → F`	`T.post := F.post`
`F → "-" F₁`	`F.post := F₁.post • "~"`
`F → "(" E ")"`	`F.post := E.post`
`F → intlit`	`F.post := intlit.sourceString`
`F → id`	`F.post := id.sourceString`

See how it works? Each symbol gets some properties (called attributes) as necessary, and we make rules that show how to assign attribute values. There’s a lot of theory here that we won’t cover, like whether attributes are synthesized or inherited, but you should work on gaining a basic understanding of what attribute grammars look like.

CLASSWORK

Let’s do some derivations.

Exercise: Actually implement this thing in Ohm.

Now here the specification using action routines. For these, it's best to change up the grammar just a bit.

E ⟶ T (addop T)*
T ⟶ F (mulop F)*
F ⟶ "-" F | "(" E ")" | intlit | id

Now our attribute grammar looks like this:

E     ⟶ T {E.post := T.post} (addop T {E.post •= T.post • addop.text})*
T     ⟶ F {T.post := F.post} (mulop F {T.post •= F.post • mulop.text})*
F     ⟶ - F₁ {F.post := F₁.post • "~"}
F     ⟶ "(" E {F.post := E.post} ")"
F     ⟶ id {F.post := image(id)}
F     ⟶ intlit {F.post := image(intlit)}
addop ⟶ "+" {addop.text := "+"} | "-" {addop.text := "-"}
mulop ⟶ "*" {mulop.text := "*"} | "/" {mulop.text := "/"}

Example: Polynomial Differentiation

An attribute grammar for finding the derivative of polynomials, using semantic functions:

Grammar Rules	Semantic Rules
`P → P₁ + T`	`P.deriv := P₁.deriv • "+" • T.deriv`
`P → P₁ - T`	`P.deriv := P₁.deriv • "-" • T.deriv`
`P → T`	`P.deriv := T.deriv`
`P → - T`	`P.deriv := "-" • T.deriv`
`T → C`	`T.deriv := "0"`
`T → C x`	`T.deriv := C.val`
`T → C x ^ E`	`T.deriv := product(E.val,C.val) • "x^" • pred(E.val)`
`T → x`	`T.deriv := "1"`
`T → x ^ E`	`T.deriv := E.val • "x^" • pred(E.val)`
`C → const`	`C.val := valueof(const)`
`E → const`	`E.val := valueof(const)`
`E → - const`	`E.val := "-" • valueof(const)`

With action routines:

P ⟶ {P.deriv := ""} ("-" {P.deriv := "-"})?
     T {P.deriv •= T.deriv}
     (
         ("+" {P.deriv •= "+"} |"-" {P.deriv •= "-"})
         T {P.deriv •= T.deriv}
     )*

T ⟶ {c := 1; e := 1} [N {c := N.value}]
     "x" ("^" ("-" {e := -1})? N {e *= -N.value})?
     {T.deriv := product(c,e) • "x^" • pred(e)}
  |  N {T.deriv := "0"}

Note that Ohm feels a lot like writing attribute grammars with semantic functions. However Ohm allows arbitrary JavaScript code in its semantic functions, which is more flexible than just slapping attributes on to parse tree nodes.

The theory behind attribute grammars is pretty rich. In the compiler literature, much has been written about the order of attribute evaluation, and whether attributes bubble up the parse tree or can be passed down or sideways through the three. It’s all fascinating stuff, and worthwhile when using certain compiler generator tools. But you can always just use Ohm and enforce contextual rules with code.

“Static Semantics”

Another approach is to just treat contextual rules as part of the semantics of a language, albeit not the same semantics that defines the runtime effects of a program. It’s static semantics, and you can use the techniques of denotational or operational semantics to enforce the contextual rules, too.

For example, here’s a way to define the contextual constraints of Astro. Here $\vdash p$ means program $p$ is statically correct; $c \vdash e$ means expression $e$ is correct in context $c$, and $c \vdash s \Longrightarrow c'$ means that statement $s$ is correct in context $c$ and subsequent statements must be checked in context $c'$. In other words, statically analyzing a statement “updates” the context.

$$\frac{}{c \vdash [\![n]\!]}$$

$$\frac{c(i) = (\mathsf{NUM}, \_)}{c \vdash [\![i]\!]}$$

$$\frac{c(i) = (\mathsf{FUN}, n) \;\;\;\; (c \vdash e_i)_{i=1}^n} {c \vdash [\![i\;e_1 \ldots e_n]\!]}$$

$$\frac{c \vdash e}{c \vdash [\![- e]\!]}$$

$$\frac{\begin{gathered} op \in \{ \mathsf{+}, \mathsf{-}, \mathsf{*}, \mathsf{/}, \mathsf{\%}, \mathtt{**}\} \\ c \vdash e_1 \;\;\; c \vdash e_2 \end{gathered}} {c \vdash [\![e_1\;op\;e_2]\!]}$$

$$\frac{c \vdash e} {c \vdash [\![\mathtt{print}\;e]\!] \Longrightarrow c}$$

$$\frac{c \vdash e \;\;\;\; c(i) = \mathsf{UNDEF} \vee c(i) = (\mathsf{NUM}, \mathsf{RW})} {c \vdash [\![i = e]\!] \Longrightarrow c[(\mathsf{NUM}, \mathsf{RW})\,/\,i]}$$

$$\frac{(c_{i-1} \vdash s_i \Longrightarrow c_i)_{i=1}^n} {\vdash [\![\mathtt{program}\;s_1 \ldots s_n]\!]}$$

$ \begin{array}{l} c_0 = \lambda\,i. \mathsf{UNDEF} \:[\\ \;\;\;\; (\mathsf{NUM},\,\mathsf{RO})\,/\,\mathtt{π}] \:[\\ \;\;\;\; (\mathsf{FUN},1) \,/\,\mathtt{sqrt}] \:[\\ \;\;\;\; (\mathsf{FUN},1) \,/\,\mathtt{sin}] \:[\\ \;\;\;\; (\mathsf{FUN},1) \,/\,\mathtt{cos}] \:[\\ \;\;\;\; (\mathsf{FUN},2) \,/\,\mathtt{hypot}] \end{array} $

Previously, we gave formal definitions of Astro and Bella in which static and dynamic semantics were defined together. If we do decide to make a static semantics on its own, then the dynamic semantics can become simpler, since we can assume all the static checks have already been done. This is, in fact, how real compilers and interpreters work.

Exercise: Define a new dynamic-only operational semantics of Astro, assuming the static semantics above.

Exercise: Split the semantics of Bella into static and dynamic parts.

Implementing a Static Analyzer

Logically speaking we do static analysis by traversing the CST or AST, decorating it, and checking things. We do quite a few tasks here, such as name and type resolution, control flow analysis, and data flow analysis. These tasks are often interleaved.

Name and Type Resolution

First we figure out which names refer to which (declared) entities, and what the types are for each expression. The first part uses is sometimes called scope analysis and involves symbol tables and the second does (some degree of) type inference.

Symbol Tables

A symbol table is a collection of mappings from names (identifiers) to entities. That is all.

There is a design thing to be aware of, though. Should you (1) create a new symbol table for every scope, or (2) manage just one symbol table? What are the tradeoffs?

	Table per Scope	Single Table
Basic Idea	There is a symbol table for every scope. Maps identifier to (usually a) single entity.	There is only one table. Maps identifier to list of entities.
On scope entry	Create a new, empty symbol table. Set this table’s parent to the current table.	Nothing really. Or you could increment a counter so you always have a nesting level associated with each entity.
On a declaration	Add identifier/entity pair to table.	Push entity onto list for the identifier.
On a use	Search current table; if not found, recursively search the parent table (and so on).	Search the list.
On scope exit	Nothing, really.	Pop all entities added in this scope.
Advantages	Easy to implement. Tables can be immutable.	Lookup is efficient.
Disadvantages	Search can be slow, especially for globals!	Mutable. Scope exit expensive.

Type inference and type checking

So how do we determine (infer) the type of every expression? This can get interesting. It depends on the language design. Statically typed languages generally fall into a spectrum of how much type inference they allow:

How much?	Characteristics	Exemplars
Low	Very verbose. Pretty much every declaration (of every variable, every field, every parameter, every function) mentions its type.	Java, prior to version 10
Medium	Generally variable declarations don’t need types (you can say `var x = 100` or `var message = "Hello"`), but parameter types and function return types must be explicit.	Go, Swift, Rust, C#
High	Virtually every identifier declaration can be given without a type. The inference engine has to do a ton of work. But the code looks so pretty and light, and we still get type safety guarantees before execution! Often uses a Hindley-Milner type system (with type variables) and perhaps, type classes.	Standard ML, OCaml, F#, Haskell, Elm

Exercise: Read about the Hindley-Milner type system.

So how to do type checking? You follow the type-equivalence and type-compatibility rules. These vary from language to language and are generally very straightforward, until you get to parameterized types, or generics. Why are these not straightforward? Well suppose Dog is a subclass of Animal:

  Dog <: Animal

Now what is the relationship between List<Dog> and List<Animal>? There are four cases:

List<Dog> <: List<Animal> — covariance
List<Animal> <: List<Dog> — contravariance
Both are subclasses of the other — bivariance
Neither are subclasses of the other — invariance

Then it gets more complicated. Do you remember, offhand, the relationships between List<Dog>, List<Animal>, and:

  List<? extends Dog>
  List<? extends Animal>
  List<? super Dog>
  List<? super Animal>
  List<?>
  List<Object>

Exercise: Draw the class inheritance graph for these classes.

Control Flow Analysis

Control Flow Analysis (CFA) is what we do when we build and query the control flow graph (CFG). This can help us find functions that are never called, code that is unreachable, some infinite loops, paths without return statements, etc.

Read about CFA and CFGs at Wikipedia.

Data Flow Analysis

In DFA, we determine where identifiers are declared, when they are initialized, when they are updated, and who reads (refers to) them. This tells us when identifiers are used but not declared, used but not initialized, declared but never used, etc. Also we can note for each identifier at each point in the program, which other entities could refer to them.

Read about DFA at Wikipedia.

Languages like Rust also do ownership and borrowing computations as part of data flow analysis. We might cover this in class. If we don’t read these three sections from The Rust Programming Language Book: Ownership, References and Borrowing, and Lifetimes.

Linting

A complier’s static analyzer only needs to check whether programs violate language rules. But there are also many such statically ”correct” programs that are written weirdly, extremely error prone, under-performant, resource-leaking, subject to race conditions, or produce completely unexpected (in other words, wrong) results when run. These problems can be captured by a linter.

Look at these cool lists:

You should integrate a linter into your text editor or IDE. Seeing both language errors (from the compiler) and linter errors while you write your program is a Good Thing.

Exercise: Skim these lists. Which rules are your favorites? Which surprised you?

Exercise: Find some linters for Python, Ruby, PHP, and C#.

Summary

We’ve covered:

Examples of Contextual Constraints
Formal Specification of Context Rules
Attribute Grammars
Static semantic rules
Implementing a Static Analyzer
Linting