Recall Questions
Do you like spaced repetition learning? Have you used Anki or Quizlet? Whether or not spaced repetition works for you, periodically working on flash-card like questions can be a lot of fun, and when done well, will certainly help you retain information.
For this course, recall questions tied to language-independent concepts are found at the bottom of the course notes pages. Please revisit them from time to time.
Short Answer Questions
Here are some questions that might be found on an in-class paper-and-pencil exam.
- In your own words, write one sentence for each of the four major theories of computation, conveying its central question and its areas of concern. Write as if your job depended on clarity, accuracy, and solid English writing skills. If you look to an AI assistant for help, do not just copy what the bot says. The scope of the four theories are kind of fuzzy, so stick with the definitions you’ve heard in class rather than the bot’s training set.
- Usually, a Turing machine just rewrites its input into an output. But sometimes, all we want to compute is the answer to a YES-NO question. In this case, how does the Turing Machine announce its answer?
It says YES by entering an accept state with no outgoing transitions for the current symbol, and answers NO by entering into a non-accept state with no outgoing transition for the current symbol. It is possible for the machine to loop forever on a given input, in which case it is said to give no answer.
- What is the difference between deciding and recognizing?
A machine decides a language if it always halts with a YES or NO answer. A machine recognizes a language if it always correctly answers YES for strings in the language, but it may or may not halt with a NO for non-members.
- Arrange R, FINITE, RE, LR, CS, CF, and REG in subset order.
- What is the language class BPP?
The set of languages that can be decided a probabilistic TM in polynomial time such that the probability of its answer being correct is $\geq$ 2/3.
- Arrange NP, EXPSPACE, EXPTIME, PSPACE in subset order.
- In a typical compiler, what does a parser produce?
A syntax tree. Some produce concrete syntax trees, others produce abstract syntax trees.
- Why are compilers split into a front end and a back end? Give the two most important reasons.
(1) You can’t help but think of translation without an intermediate conceptual representation. (2) Your front-ends are reusable for many targets, and your backends are reusable for many source languages.
- Critique the claim “Writing a transpiler means your compiler needs only a front end and not a backend.” There’s a kernel of truth to it, but it might not be wholly accurate.
You do need a backend to generate the target code from an intermediate representation. Some purists might claim that to be a “real backend” you have to generate some really great, optimized assembly language for a serious target machine, so the claim all hinges on what is meant by “backend.”
- Is the program
tsc a compiler or a transpiler? Do we care? Well, it does stand for “TypeScript compiler” so sure, it’s a compiler because it compiles TypeScript into JavaScript. But then again, JavaScript is a high level language so you can call it a transpiler too. Both terms work. Don’t be picky. Don’t be that person. It’s not worth getting worked up about.
- How can we make the negation operator and the exponentiation operator not associate with each other? Show a grammar fragment.
Put them on the same “level”:
Exp7 = "-" Exp8
| Exp8 "**" Exp7
| Exp8
- For each scenario below, give Ohm grammar rules for
Exp, Term, Factor, and Primary that make the expression -2**2:
- Evaluate to $4$.
- Evaluate to $-4$.
- Be a syntax error, while allowing
(-2)**2 and -(2**2) to be legal.
- Why do language designers put functions like
sqrt into a standard library (as opposed to being wired into the language, or left to a “third party” library? It is so commonly used that most people want or expect it without having to import an external library, and yet, it is not common enough to warrant its own operator wired into the core syntax of a language.
- What is wrong with the Ohm rule
WhileStmt = "while" Exp Block
How do we fix this problem? If the expression begins with a letter, Ohm would still match the while statement even if there were no spaces between the word “while” and the expression! To fix this, create a lexical category while = "while" ~idrest and redefine the while statement rule as WhileStmt = while Exp Block.
- How do we write a rule for JavaScript-style one-line comments in Ohm?
- The Ohm notation
Factor = "-" Primary -- negation
| Primary
is actually an abbreviation for two separate rules. Give those two rules.
Factor = Factor_negation
| Primary
Factor_negation = "-" Primary
- The Ohm rule
Exp = Term ("+" Term)*
fails to capture what aspect of the + operator in the syntax? Associativity.
- In Ohm, the construct
A ~B matches an A that is not followed by a B. How do we match an A that is followed by a B (without consuming the B)?
- Can an Ohm grammar ever be ambiguous?
Yes and no. It depends. Certainly PEGs can’t be ambiguous due to prioritized choice over non-deterministic choice and their prohibition against left recursion; but Ohm allows left-recursion so you can write:
A = A "+" A --plus
| "a"
which looks ambiguous. Ohm actually makes the operator here be right associative, so given the fact that Ohm is an implementation and always parses strings
exactly one way, then no, its grammars are not ambiguous; however, this isn’t really specified anywhere so maybe yeah, that grammar in some sense
is ambiguous. See issues
55 and
56 for more information.
- Given categories E (for expression) and T (for term), give an Ohm grammar rule to make the operator • on terms be left-associative.
- Given categories E (for expression) and T (for term), give an Ohm grammar rule to make the operator • on terms be right-associative.
- Given categories E (for expression) and T (for term), give an Ohm grammar rule to make the operator • on terms be non-associative.
- Here is an attempt to remove the need for operator precedence levels in a language design. Does it work? Why or why not?
E = E binaryop "(" E ")"
| num
It does, but I’ll admit it’s hard to prove. Can someone help?
- The parser generator in the Ohm system is unlike most others, in that it is not based on context-free grammars. What theoretical language description mechanism does it use?
Parsing Expression Grammars, or PEGs.
- Why do many languages make relational operators non-associative?
Reasonable people can disagree on the meaning of a<b<c. Some people think it should be automatically expanded to a<b && b<c. Others think it should be (a<b)<c.
- In a language with string interpolation, what do we usually call the literal portions of a string? What are the interpolated portions called in ESTree?
Quasis.
- What is the difference between an expression and a statement?
Expressions produce values; statements don’t. Statements are executed only for their effect.
- In many languages, expressions can appear within statements, but not the other way around. In JavaScript, however, statements can appear within expressions. Give an example.
console.log(() => {if (true) return;})
- In JavaScript, the left hand side of an assignment is not “just a variable”. What do we call that construct?
A pattern.
- Why is division by zero not considered a dynamic semantic error in Java?
An exception is thrown in this case and can be caught, with the program proceeding normally. Throwing and catching exceptions is well-defined and certainly does not violate any language rules.
- In Java, the grammar allows
x < y < z. So what exactly happens when the compiler encounters this code fragment (assuming all variables are in scope)? The operator is left-associative so x < y is checked. That is either a type error or a perfectly valid comparison assigned the type boolean. Booleans can’t be compared anyway, so the entire expression is always a type error, detectable at compile time! But it is NOT a syntax error.
- Type checking is often concerned less with whether two types are identical, but rather when elements of one type $T_1$ can be assigned to an Lvalue constrained to be of type $T_2$. In what situations is this check made?
(1) variable initialization, (2) assignment, (3) passing arguments to parameters, (4) returning from a function.
- In the conditional expression
x ? y : z of a typical statically-typed language, what type checking and inference rules would a compiler be required to enforce? Checks: the type of x must be boolean and the types of y and z must be compatible. Inference: the type of the entire expression is the least general type of both y and z.
- How does a semantic analyzer check the legality of mutually recursive functions?
In each block it first analyzes the signatures of each block and adds the function name and signature to the block’s context. Then it analyzes the function bodies (on a second “pass” through the block).
Problems
Here are some problems that require some thinking, maybe a fair amount of research, and some actual work. They may involve writing little scripts, or making sketches. They aren’t exactly short-answer problems.
Some of the problems may refer to languages you have never heard of! If so, you can try solving the same problem with a language you are familiar with, or, better, look up the basics of the unfamiliar language so that you can take your best shot at the problem.
Formal Languages
The first three problems are courtesy of Phil Dorin.
- Let $L = \{ w \in \{0,1\}^* \mid w = w^R \}.$
- Is $\varepsilon \in L$?
- Is $101 \in L$?
- Is $101 \in L^2$?
- Is $1010 \in L^2$?
- Is $01101101110 \in L^*$?
- Let $L_1 = \{ w \in \{0,1\}^* \mid w \textrm{ has an even number of 0s and an odd number of 1s} \}$ and $L_2 = \{ w \in \{0,1\}^* \mid w = w^R \}$.
- Is $\varepsilon \in L_1L_2$?
- Is $10010110 \in L_1L_2$?
- Is $0010110101111 \in (L_1L_2)^*$?
- Is $L_1 \subseteq L_1L_2$?
- Is $L_2 \subseteq L_1L_2$?
- Is $L_1$ countably infinite? If so, prove via an appropriate bijection; if not, prove via a proof to the contrary.
- Is ${L_1}^*$ countably infinite? Prove or disprove.
- Let $L$ be the language denoted by the regular expression $0^*1 + 11(1 + 010)^*10$.
- Is $\varepsilon \in L$?
- Is $01 \in L$?
- Is $0001 \in L$?
- Is $0111 \in L$?
- Is $10 \in L$?
- Is $110100101111 \in L$?
- Is $1101001011110 \in L$?
- Is $111011101110 \in L^*$?
- Is $\varepsilon \in L^*$?
- Is $L$ countable?
- Given $L_1 = \{0, 011, 10\}$ and $L_2 = \{10, 1\}$. What are:
- $L_1 \cup L_2$
- $L_1 \cap L_2$
- $L_1L_2$
- ${L_2}^*$
Generative Grammars
- Give grammars for the following languages, all over $\{ 0, 1 \}$:
- Strings of length $\geq 2$
- Odd binary numerals
- Even binary numerals
- Binary numerals divisible by 3
- Binary numerals divisible by 4
- Signed binary numerals that are negative (in 2's complement form)
- Give grammars for the following languages, all over $\{ a, b \}$:
- Strings of length $\geq 2$
- Strings containing only $a$'s, except that the first character could be a $b$
- Strings containing at least 5 $a$'s
- Strings containing at least 5 consecutive $a$'s
- Strings not containing two consecutive $b$'s
- Strings whose 8th symbol from the right is a $b$
- Strings having twice as many $a$'s as $b$'s
- Strings having 3 times as many $a$'s as $b$'s
- Palindromes of even length
- Palindromes of odd length
- Palindromes of any length $(ww^R)$
- Strings whose first and last haves are the same ($ww$)
- Strings with an odd number of $a$'s and an even number of $b$'s
- $a^nb^n$
- Give grammars for the following languages, where the alphabet is the smallest one that makes sense for the language description:
- $\{ a^ib^jc^i \mid j = 2i \}$
- $\{ a^ib^jc^i \mid j \leq i \}$
- $\{ a^ib^jc^i \mid j \geq i \}$
- $\{ a^ib^jc^id^k \mid i,j,k \geq 1 \wedge k \textrm{ is a multiple of 3} \}$
- $\{ a^ib^jc^k \mid i \neq j \vee j \neq k \}$
- $\{ a^nb^nc^n \mid n \geq 0 \}$
- $\{ a^n \mid n \textrm{ is a power of 2} \}$
- $\{ a^n \mid n \textrm{ is prime} \}$
- $\{ a^n \mid n \textrm{ is not prime} \}$
- $\{ w \in \{a,b,c\} \mid \#_a(w) = \#_b(w) = \#_c(w) \}$
We need the variables here because left hand sides of rules must always have at least one variable.
s = (x y z)* -- repeat xyz 0 or more times
x y = y x -- switch them up all possible ways
x z = z x
y x = x y
y z = z y
z x = x z
z y = y z
x = "a" -- erase the variables
y = "b"
z = "c"
- $\{ a^ib^jc^id^j \mid i,j \geq 0 \}$
s = (l r)? -- start with left part and right part
l = "a" l? x -- generate a's on the left, counting them with x's
r = "b" r "d" | y -- generate equal nums of b's and d's leaving one y in the middle
x "b" = "b" x -- move the x's to the right, in order to generate c's
x y = y "c" -- when the x hits the y, make a c for it
y = ε -- erase the y to finish it off
- $\{ a^ib^jc^k \mid 1 \leq i \leq j \leq k \}$
s = "a" x? "b" z "c" -- initial set up
x = "a" x "b" y -- if you generate an a, you must do b and c also
| x? "b"? y -- generate bc or c (keeping things increasing)
y "b" = "b" y -- move y's to the right
y z = z "c" -- when y hits the z, make a c
z = ε -- when the z is no longer needed, drop it
- Give grammars for the following languages:
- The empty language
- $\{ 0^i1^j2^k \mid i=j \vee j=k \}$
- $\{ w \in \{0,1\}^* \mid w \textrm{ does not contain the substring 000} \}$
- $\{ w \in \{a,b\}^* \mid w \textrm{ has twice as many $a$'s as $b$'s} \}$
- $\{ a^nb^na^nb^n \mid n \geq 0 \}$
- Here’s another look at the grammar for floating-point numerals (using single-letter variables for compactness):
$\begin{array}{l}
n \longrightarrow d^+ \; f? \; e? \\
f \longrightarrow \texttt{"."} \; d^+ \\
e \longrightarrow (\texttt{"E"} | \texttt{"e"})\; (\texttt{"+"} | \texttt{"–"})? \; d^+ \\
d \longrightarrow \texttt{"0"} .. \texttt{"9"} \\
\end{array}$
Give the $(V, \Sigma, R, S)$-definition of this grammar. (Note this means you will have to desugar the rules with |, ?, and +.)
- Give grammars for the languages:
- $\{ a^nb^nc^n\mid n \geq 0 \}$
- $\{ a^ib^jc^k \mid i=j \mathrm{\;or\;} j=k \}$
- $\{ ww \mid w \in \{a,b\}* \}$
- The following is a failed attempt to write a grammar for the language $L = \{w \in \{a,b\}* \,\mid\, w \mathrm{\;has\;exactly\;twice\;as\;many\;} a\mathrm{s\;as\;}b\mathrm{s}\}$:
$S → aab \mid aba \mid baa \mid aaSb \mid abSa \mid baSa \mid aSab \mid aSba \mid bSaa \mid SS$
- Prove that $aaabbbbaaaaa$ is not in the language generated by this grammar.
- Give a correct context free grammar for $L$ (and don’t forget, that the empty string belongs, too).
There’s a little backstory to this problem. I was given this problem while a student in UCLA in early 1988. The TA gave the incorrect answer above. I showed the problem and the TA’s solution to Phil Dorin, who didn’t think the solution was right and worked on and off for 10–15 years to prove it wrong. Finally he wrote the following message to his teacher, Sheila Greibach:
Among the reasons that I have wanted to write is that, many years ago, my colleague, Ray Toal, whom you know, passed along a set of problem solutions that he received while studying at UCLA. They were for the 181 course—his instructor at the time was a fellow named Gabriel Robins, which probably tells you how long ago this was!—and they contained an error that I had always meant to report to you. (It’s been so long now that it has probably been corrected, but I’ll sleep a lot better once I’ve sent this off.) Specifically, the problem was to give a cfg that generated the set of all strings over alphabet $\{a,b\}$ with exactly twice as many $a$s as $b$s, which, I believe, was also a problem in an earlier edition of Hopcroft and Ullman. In any event, he gave the following solution, which he attributed to Lui:
S → SS
S → aaSb
S → abSa
S → baSa
S → aSab
S → aSba
S → bSaa
S → aab
S → aba
S → baa
Now, I am going feel awfully much like an idiot if I am wrong about this, but... how does this grammar produce the string $aaabbbbaaaaa$ (that is, three $a$s, followed by four $b$s, followed by five $a$s)? I have managed to prove to myself that it simply can NOT produce this string, and I wonder if I should trouble you to look at it and let me know. (Technically, the grammar is also missing a rule for producing the empty string, which is also in the language, but that’s another matter.)
I do believe that a correct grammar is:
S → [empty string]
S → SS
S → aSaSb
S → aSbSa
S → bSaSa
I’ve also worked the problem from the other direction: I constructed a npda, converted it to a cfg, and simplified it (by removing useless symbols, etc.)—but the resulting grammar doesn’t look anything like the above ones, so this didn’t provide much new insight.
Prof. Greibach gave a nice reply, and as part of it managed to state almost nonchalantly the “obvious” proof (at least to her—in a single sentence!) of non membership of $aaabbbbaaaaa$:
It does indeed fail on the example you gave since the first rule applied could not be any of those starting S → a... or S → b... and S → SS cannot be used because the example is not the concatenation of 2 words in the
language.
Turing Machines
- Give a Turing Machine for multiplying a binary number by 8.
- Give a Turing Machine for floor-dividing a binary number by 4.
- Give a Turing Machine for negating a signed binary number (i.e., producing its two’s complement).
- Give a Turing Machine for incrementing a binary number.
- Give a Turing machine that produces the string "1" if its input consists of all zeros, or the string "0" otherwise.
- Give a Turing machine that determines whether a signed binary number is negative, i.e., that recognizes $\{ w \in \{0,1\}^* \mid w \textrm{ is a negative signed binary number} \}$.
- Give a Turing machine that determines whether a binary number is divisible by 5, i.e., that recognizes $\{ w \in \{0,1\}^* \mid w \textrm{ mod } 5 = 0 \}$.
- Give a Turing Machine ($\Sigma = \{ A \ldots Z \}$) that erases its entire input and writes the message
HELLO.
- Give a Turing Machine ($\Sigma = \{ a,b,c \}$) that appends its input to itself. For example, if your input was $abbca$ then the output would be $abbcaabbca$.
- (Submitted by Amanda Marques) Give a Turing machine ($\Sigma = \{ 0, 1 \}$) that determines if its input is a palindrome, i.e., that recognizes $\{ w \in \{0,1\}^* \mid w = w^R \}$.
- (Submitted by Amanda Marques) Give a Turing Machine ($\Sigma = \{ a,b,c \}$) that appends the reversal of its input to itself, thereby generating a palindrome. For example, if your input was $abbca$ then the output would be $abbcaacbba$.
- Give a Turing machine ($\Sigma = \{ 1 \}$) that determines whether its input is exactly 8 symbols long, i.e., that recognizes $\{ w \in \{1\}^* \mid |w| = 8 \}$.
- Give a Turing machine ($\Sigma = \{ 1 \}$) that determines whether its input is exactly 88 symbols long, i.e., that recognizes $\{ w \in \{1\}^* \mid |w| = 88 \}$.
- Give a Turing machine ($\Sigma = \{ 1 \}$) that determines whether the length of its input is a power of 2.
- (Submitted by Amanda Marques) Give a Turing machine ($\Sigma = \{ 0, 1 \}$) that determines if its input does not contain the substring 000.
- Give a Turing machine ($\Sigma = \{ a, b \}$) that determines if its input has the same number of occurrences of $a$'s as $b$'s, i.e., that recognizes $\{ w \in \{a,b\}^* \mid \#_a(w) = \#_b(w) \}$.
- Give a Turing machine that recognizes $\{ a^nb^n \mid n \geq 0 \}$.
- Give a Turing machine that recognizes $\{ a^nb^nc^n \mid n \geq 1 \}$.
- Give a Turing machine ($\Sigma = \{ 0, 1 \}$) that determines if its input contains at least three zeros (not necessarily contiguous).
- Give a Turing machine ($\Sigma = \{ a, b \}$) that determines if its input contains an even number of $b$'s.
EVEN,a,a,R,EVEN
EVEN,b,b,R,ODD
ODD,a,a,R,ODD
ODD,b,b,R,EVEN
EVEN,#,#,L,ACCEPT
- Give a Turing machine that determines for two strings, whether the first is longer than the second, given the following set up. The input alphabet is $\Sigma = \{ a, b, • \}$ and the input will be in the form $w•x$ for strings $w$ and $x$. Your TM should recognize $\{ w•x \mid w,x \in \{ 0, 1 \} \wedge |w| > |x| \}$.
- Give a Turing machine that determines for two strings, whether the first is a substring of the second, given the following set up. The input alphabet is $\Sigma = \{ a, b, • \}$ and the input will be in the form $w•x$ for strings $w$ and $x$. Your TM should recognize $\{ w•x \mid w,x \in \{ 0, 1 \} \wedge w \textrm{ is a substring of } x \}$.
- Give a Turing machine that computes the sum of two unary numbers, given the following set up. The input alphabet is $\Sigma = \{ 1, • \}$ and the input will be in the form $w•x$ for strings $w$ and $x$. Your TM should output $1^{i+j}$ when it sees the input $1^i•1^j$.
- Give Turing Machines that recognize the following languages. If any of the languages below are Type-3, you may (and are encouraged to) give a FA in lieu of a TM recognizer, if the FA is simpler.
- $\{w \in \{a,b\}* \mid w \textrm{ ends with } abb\}$
- $\{ w \in \{a,b\}^* \mid \#_a(w) = \#_b(w) \}$ (same number of $a$'s as $b$'s)
- $\{w \in \{a,b\}* \mid w \textrm{ alternates } a\textrm{'s and } b\textrm{'s} \}$
- $\{ a^nb^na^nb^n \mid n \geq 0 \}$
- Give Turing Machines that compute the following functions, where the input and output are binary numerals.
- $\lambda n. 2n + 2$
- one's complement
- The function described in Python as
lambda n: str(n)[1:-1]
Register Machines
- For the JavaScript/Python expression
5 * 3 - 1 ** 3,
- Show a 3AC program to evaluate this expression, leaving the result in $r_0$
- Show a 0AC (stack machine) program to evaluate this expression, leaving the result on the top of the stack.
- Give stack machine code for
x = y * (2 + z).
load y
load 2
load z
add
mul
store x
Language Classification
- Characterize each of the following languages as either (a) regular, (b) context-free but not regular, (c) recursive but not context-free, (d) recursively enumerable but not recursive, or (e) not even recursively enumerable.
- $\{ a^ib^jc^k \mid i > j > k \}$
- $\{ a^ib^jc^k \mid i > j \wedge k \leq i-j \}$
- $\{ \langle M\rangle\cdot w \mid M \textrm{ accepts } w\}$
- $\{ G \mid G \textrm{ is context-free} \wedge L(G)=\varnothing \}$
- $\{ a,b \}^*\{b\}^+$
- $\{ \langle M\rangle \mid M \textrm{ does not halt }\}$
- $\{ w \mid w \textrm{ is a decimal numeral divisible by 7} \}$
- $\{ www \mid w \textrm{ is a string over the Unicode alphabet} \}$
Compilation in Practice
- Find, and link to, real-life examples of self-hosting and cross compilers.
- Suppose a new computer called the X1234 has just come out and it doesn’t have a Swift compiler. But you want to make a resident Swift compiler on that machine. Fortunately you have a resident Swift compiler that runs on a MIPS machine. Describe exactly how you can construct the desired resident Swift compiler for the X1234 using the one for the MIPS.
Syntax
- Suppose we added to EBNF the form $A\verb!^!B$ which denotes $A \mid ABA \mid ABABA \mid \ldots$. Such a form makes it convenient to write rules involving separators, such as
IDLIST → ID ^ ","
This form can also be used to model a construct representing one or more $A$s, rather than using $AA^*$ or $A^*A$. Show how to do this.
Let $\varepsilon$ represent the empty string. Then we can use $A\verb!^!\varepsilon$.
- Here are a few Ohm grammar rules from the Ada programming language:
Exp = Exp1 ("and" Exp1)* | Exp1 ("or" Exp1)*
Exp1 = Exp2 (relop Exp2)?
Exp2 = "-"? Exp3 (addop Exp3)*
Exp3 = Exp4 (mulop Exp4)*
Exp4 = Exp5 ("**" Exp5)? | "not" Exp5 | "abs" Exp5
comment = "--" ~"\n" any
- What can you say about the relative precedences of
and and or?
- If possible, give an AST for the expression
X and Y or Z. (Assume, of course, that an Exp5 can lead to identifiers and numbers, etc.) If this is not possible, prove that it is not possible.
- What are the associativities of the additive operators? The relational operators?
- Is the
not operator right associative? Why or why not?
- Why do you think the negation operator was given a lower precedence than multiplication?
- Give an abstract syntax tree for the expression
-8 * 5.
- Suppose the grammar were changed by dropping the negation from Exp2 and adding
- Exp5 to Exp4. Give the abstract syntax tree for the expression -8 * 5 according to the new grammar.
- The official grammar of the C programming language has over a dozen levels of operator precedence defined within the grammar. Write this subset of C syntax using Ohm.
- Describe each of the following languages in both EBNF and Ohm:
- $\{w \in \{a,b,c\}* \mid w \mathrm{\;has\;at\;most\;one\;occurrence\;of\;any\;symbol}\}$
- $\{a^mb^nc^{m+n} \mid m \geq 1 \wedge n \geq 1 \}$
- Palindromes over $\{a, b\}$
- $\{a^mb^n \mid m \geq n \}$
- Strings of parentheses, brackets and braces, all properly balanced and nested
- Semicolon terminated statements
- Comma separated expressions
- Strings over $\{a, b, c, d, e\}$ containing at most one occurrence of any symbol
- EBNF generally uses
- $A\:B$ to mean exactly one $A$ followed by exactly one $B$
- $A?$ to mean zero or one $A$
- $A^*$ to mean zero or more $A$s
- $A \mid B$ to mean either exactly one $A$ or exactly one $B$
Suppose I wanted to add a new one:
- $A_1 \# A_2 \# ... \# A_n$ to mean “a non-empty string in which each of the $A_i$s appears zero or one times, but in any order.”
Show how to write $A \# B \# C$ using only the conventional EBNF markup.
$A \mid B \mid C \mid AB \mid AC \mid BA \mid BC \mid CA \mid CB \mid ABC
\mid ACB \mid BAC \mid BCA \mid CAB \mid CBA$
- Suppose we are designing a language and wish that no identifier could be exactly three characters long and end with
"oo" (or "oO" or "Oo" or "OO").
- Write a regex for alphanumeric strings beginning with a letter that are not three characters long ending case-insensitively with
"oo".
- Give a (lexical) Ohm rule to define identifiers as any string of alphanumerics and underscores, beginning with a letter, that satisfies our wish.
- We’ve seen that one way to deal with ugly code in curly brace languages is to require blocks in compound statements; for example:
IfStmt = "if" "(" Exp ")" Block
("else" "if" "(" Exp ")" Block)*
("else" Block)?
Block = "{" STMT* "}"
What if we tried the same approach in a language with a syntax like Ruby (or Fortran or Modula — languages using a terminating end)? We might get a grammar like this:
IfStmt = "if" Exp "then" STMT+
("else" "if" Exp "then" STMT+)*
("else" STMT+)?
"end"
Is this grammar left recursive? Is it $LL(k)$? Why or why not? Is this bad?
- Is this grammar an $LL$ grammar?
A → B C
B → a | b?c?
C → c | BA
If this grammar is not $LL$, make one that is (that defines the same language of course). Give a set of syntax diagrams for the original diagram, and if another is needed, for the new grammar as well.
- Here’s an Ohm grammar:
S = A M
M = S?
A = "a" E | "b" A A
E = ("a" B | "b" A)?
B = "b" E | "a" B B
- Describe in English, the language of this grammar.
- Draw a parse tree for the string
"abaa"
- Prove or disprove: “This grammar is $LL(1)$.”
- Prove or disprove: “This grammar is ambiguous.”
- Here’s a grammar that’s trying to capture the usual expressions, terms, and factors, while considering assignment to be an expression.
$
\begin{array}{lcl}
\mathit{Exp} & \longrightarrow & \textit{id}\;\texttt{":="}\;\textit{Exp} \;|\; \mathit{Term}\;\mathit{TermTail}\\
\mathit{Term} & \longrightarrow & \mathit{Factor}\;\mathit{FactorTail}\\
\mathit{TermTail} & \longrightarrow & (\texttt{"+"}\;\mathit{Term}\;\mathit{TermTail})? \\
\mathit{FactorTail} & \longrightarrow & (\texttt{"*"}\;\mathit{Factor} \mathit{FactorTail})? \\
\mathit{Factor} & \longrightarrow & \texttt{"("}\;\mathit{Exp}\;\texttt{")"} \;|\; \textit{id}
\end{array}
$
- Prove that this grammar is not $LL(1)$.
Both alternatives for Exp expand to a string beginning with id.
- Rewrite it so that it is $LL(1)$.
- Rewrite the grammar as a PEG.
- Write the grammar using Ohm, using left-recursion.
- Astro is a really tiny language, so we'd like to make Astro++. This new language adds an if-statement, a while statement, a break statement, and relational operators. The while statement should start with the keyword
while, followed by a test expression, followed by a block (a curly-brace delimited sequence of statements). The if-statement should start with the keyword if, followed by a test expression, then a block, then an optional else-part which is the keyword else followed by either a block or another if-statement. Neither the while statement nor the if statement should end with a semicolon. The break statement is only allowed to appear in a while statement’s block. The relational operators are the same as those in Python and are to be NON-associative. All of the relational operators are on the same precedence level, lower than all other operators. Give the syntax of Astro++ using Ohm. Hint: Check your grammar with the Ohm Editor so that you don’t needlessly throw away points.
Regular Expressions
- Write a function in the language of your choice that returns whether its input string is a three character alphanumeric string ending, case insensitively, in
"oo". Do this by matching against a regular expression.
- Describe, in English, the languages expressed by these regular expressions:
[01]*(10111[01] | 11[01][01][01][01])[01]*
([bc]*a[bc]*a[bc]*)*
0*1 | 0*10
c*a[ac]*b[abc]*
- Write regular expressions that:
- Match octal constants in C
0[0-7]*
- Match hexadecimal numerals divisible by 8 (signed or unsigned!)
- Match strings that begin with unsigned 32-bit hexadecimal numerals divisible by 16
- Match entire strings that are sixteen-bit hexadecimal numerals (signed or unsigned!) divisible by 8
- Match entire strings that are unsigned binary numbers, of any size, divisible by 8
- Match floating point constants that are not allowed to have an empty fractional part and can have no more than three digits in the exponent part
- Match floating point constants that are allowed to have an empty fractional part and can have no more than four digits in the exponent part
- Match identifiers that are strings of letters, digits, and underscores, that begin with a letter, are not allowed to end with an underscore, and cannot contain two successive underscores anywhere in the text.
- Match non-empty words consisting of the letters a-z whose first and second halves are the
same (i.e., in set notation: {ww | w ∈ {a..z}+})
- Match entire character strings that contain neither the substring
"return" nor "retry"
- Match entire strings of that must be made up of lowercase Basic Latin letters only and that contain neither the substring
"exit" nor "exec"
- Match entire strings that contain neither the substring
"exit" nor "exec"
- Match all words in a string (use
\b for word boundaries) that are preceded by the word “the”.
- Match words containing two adjacent double-letters.
- Match strings of digits not preceded by a dash.
(?<!-|\d)\d
- Write JavaScript regular expressions for the following. Please take advantage of character classes, lookarounds, and backreferences where they apply.
- Canadian Postal Codes (make sure to prohibit D, F, I, O, Q, U)
- Legal Visa® Card Numbers, ignoring the Luhn checksums, i.e., accept
4 + 15 digits or 4 + 12 digits.
- Legal MasterCard® Numbers, ignoring the Luhn checksums, i.e., accept
51-55 + 14 digits or 2221-2720 + 12 digits.
- Strings of Basic Latin letters except those strings that are exactly three letters ending with two Latin letter o’s, of any case.
- Binary numerals divisible by 16.
- Decimal numerals in the range 8 through 32, inclusive.
- All strings of Unicode letters, except
python, pycharm, or pyc.
- Floating point constants that are allowed to have an empty fractional part, but whose exponent part is required and can have no more than three digits in the exponent part
- Palindromes over the letters a, b, and c, of length 2, 3, 5, or 8
- Python string literals. Don't get too fancy here—just translate what you see in the Python Reference linked above into Ohm notation.
Language Features
- C does not allow structures (i.e., non-atomic objects) to be tested for equality. Ada does. Maybe the designers of C wanted to keep things simple. How exactly would equality operations for structures complicate a C compiler or the runtime system?
- If possible, write a program in Modula 3 that makes a variable point to itself. That is, for some designator
X, make it so that X^ = X. If this is not possible, state why it is not possible.
- If possible, write a program in Ada that makes a variable point to itself. That is, for some designator
X, make it so that X.all = X. If this is not possible, state why it is not possible.
- If possible, show how to make a Standard ML variable
x of type x such that x.x = x, or state why this is impossible.
- In C++ you can say
(x += 7) *= z but you can’t say this in C. Explain the reason why, using precise, technical terminology. See if this same phenomenon holds for conditional expressions, too. What other languages behave like C++ in this respect?
- Consider the
continue statement of C.
- What kind of static semantic checks are required for this statement?
- Give an example piece C code that has a continue statement in it, and show the intermediate and target code for it.
- Some languages do not require the parameters to a subprogram call to be evaluated in any particular order. Is it possible that different evaluation orders can lead to different arguments being passed? If so, give an example to illustrate this point, and if not, prove that no such event could occur.
- Ada allows subprograms to be objects, as in the following code fragment:
type Real_To_Real is access function (Real) return Real;
type Foo is access procedure (Integer; in out Boolean);
Sine, Cosine: Real_To_Real;
P: Foo;
Q: Real_To_Real;
function Integrate (F: Real_To_Real; A, B: Real);
...
function Square (X: Real) return Real is
begin
return X * X;
end;
...
Put (Integrate(Square'Access, 3, 10));
Q := Cosine;
if Q(Pi) > X then ...

Describe the semantic rules relating to this facility in Ada, and how you would enforce them in a compiler.
- It is a well-known irritation that Ada does not allow you to write array aggregates for zero- or one-element arrays, e.g.,
A := (3) gives a static semantic error when $A$ is a one-element array of Integer. Why is this so? Propose a (trivial) syntactic extension to Ada that would remove this irritation.
- In Ada, the declarations
X: Integer := X + 1;
Foo: Foo;
Bar: Real := Bar(Foo);

(where global declarations of X, Foo and Bar are visible) are all illegal, since a declaration of an identifier hides global declarations of the same name immediately at the point it appears in the text, but the identifier may not be used until its declaration is complete. Give an alternate interpretation under which these declarations would be legal and explain the advantages and disadvantages of it from both the programmer’s and the compiler writer’s perspectives.
- In C++ it is not permitted to have two functions that differ only in return type overload each other. In Ada it is allowed. What is the reason for this situation? Even though Ada does allow this flexibility in overloading, the compiler needs some sophistication. What exactly is involved? Be very precise in your explanation and illustrate it with code fragments.
- Some programming languages require that in order to have mutually recursive functions, the programmer first define the first function’s signature (name, return types, parameters and parameter types), then the entire second function, then the entire first function. For example, in C++:
int f(int x, char y);
void g(int x) {if (x < 0) f(2, 'c');}
int f(int x, char y) {g(randomInteger());}
In C++, when f is finally declared, the names of the formal parameters don’t have to be repeated exactly as they appeared in the incomplete specification. But in Ada they do. Explain why the Ada rule makes life much easier for the compiler writer.
- Many languages have a syntax rule
DESIGNATOR → DESIGNATOR "." ID
for specifying variables made up from a record and a field of the record. But sometimes it can have the additional interpretation that the DESIGNATOR to the left of the dot was the name of a (visible) subprogram and the ID was an object declared immediately inside that subprogram. Show how to rearchitect the entity class hierarchy to support this.
- An online troll suggested that JavaScript was really confusing because it uses square brackets for array expressions, instead of simple parentheses. "After all," this person says, "in English we don’t use square brackets much if ever, so it should have had regular parentheses." Can we change JavaScript to work this way, and in doing so, affect only array expressions, that is, not cause any ambiguities in existing JavaScript code not involving arrays? If so, show what the following would look like:
- The assignment of a four-element array expression to a variable
- The assignment of a one-element array expression to a variable
- The assignment of a zero-element array expression to a variable.
and explain why your solutions satisfy the restriction that the change only affects expressions with array expressions.
- How do JavaScript and Rust treat the following:
let x = 3;
let x = 3;
- Describe how the languages Java and Ruby differ in their interpretations of the meaning of the keyword
private. You can use an AI chatbot for help, but please trim down the long-winded applications those tools are known for, and give a concise explanation that proves you truly understand the difference.
- Some languages do not require the parameters to a function call to be evaluated in any particular order. Is it possible that different evaluation orders can lead to different arguments being passed? If so, give an example to illustrate this point, and if not, prove that no such event could occur.
- Some languages do not have loops. Write a function, using tail recursion (and no loops) to compute the minimum value of an array or list in Python, C, JavaScript, and in either Go, Erlang, or Rust (your choice). Obviously these languages probably already have a min-value-in-array function in a standard library, but the purpose of this exercise is for you to demonstrate your understanding of tail recursion. Your solution must be in the classic functional programming style, that is, it must be stateless. Use parameters, not nonlocal variables, to accumulate values. Assume the array or list contains floating-point values.
- Your friend creates a little JavaScript function to implement a count down, like so:
function countDownFrom10() {
let i = 10;
function update() {
document.getElementById("t").innerHTML = i;
if (i-- > 0) setTimeout(update, 1000);
}
update();
}

Your other friend says “Yikes, you are updating a non-local variable! Here is a better way:”
function countDownFromTen() {
function update(i) {
document.getElementById("t").innerHTML = i;
if (i-- > 0) setTimeout(update(i), 1000);
}
update(10);
}

What does your second friend’s function do when called? Why does it fail? Your friend is on the right path though. Fix their code and explain why your fix works.
Abstract Syntax
- Draw the AST for the following JavaScript program, using the level of detail we used during class. You can use my JS AST Viewer to guide you and check your work, but remember, the drawing you need to produce for full credit will be far less verbose than the tool’s output. (Remember, the tool uses a third-party parser, esprima-next, that provides a complete ESTree-compliant AST, which is far more verbose than expected for hand-drawn ASTs.)
let [x, y] = [0, 0];
console.log(93.8 * 2 ** x + y);

- Draw the AST for the following JavaScript program.
import x from "x"
console.log(93.8 * {x} << x.r[z])

- Draw the AST for the following JavaScript program.
const x = x / {[x]: `${x}`}[x]("y")

- For the following JavaScript fragment (not a complete program), draw the AST.
class C {
f({a, b: c}) {return ([a,f]) => C}
}

- Draw a JavaScript abstract syntax tree for the following script.
let [x, y] = Array.repeat(10|-2, 2);
function f({x, y}, ...p) {
return q => `"Say ${y/p[0].x} today`;
}

- Show a Java abstract syntax tree for:
static protected synchronized long g(Object... m) {
for (int y : f(x)) {
x = p.data[0] * (3<< 7|- x---c);
}
}

- Draw the abstract syntax tree for the following C fragment:
for (int i = x-3; q<=4&m.z[r |- 4]&2-8*r>- 5/~x;) {
while (a) {
y;
2,y;
}
}

- Draw the abstract syntax tree for this C function declaration:
void f(int x,...) {
struct e {
double x;
struct e *c[10];
char* (*f)();
};
struct e p;
exit(p.c[1]->f()[6 |~ x+2 >> x]);
}

- Draw the abstract syntax tree for this C function declaration:
int abc() {
return x = 4&x---*&y.m[-9];
}

- Give an abstract syntax tree for the following Java code fragment:
if (x > 2 || !String.matches(f(x))) {
write(-3 * q);
} else if (! here || there) {
do {
while (close) tryHarder();
x = x >>> 3 & 2 * x;
} while (false);
q[4].g(6) = person.list[2];
} else {
throw up;
}

- Draw the abstract syntax tree for the following Java compilation unit. (Make sure it is fairly abstract):
package p;
class C implements A {
public static A x = new t[3];
Socket s () {
while (x - 6>p | e || q +- p) {
this.x[3] = !v+++t;
}
}
{System.out.println("ooh");}
}

- Draw the AST for the following C fragment:
(a = 3) >= m >= ! & 4 * ~ 6 || y %= 7 ^ 6 & p

Assembly and Machine Language
- Write an assembly language program that displays a multiplication table of size 12 × 12.
- Write in assembly language a translation of the following C function
double f(int x, double y) {
return 4 * x + y;
}

- Under what circumstances can you safely replace the x86 code fragment
je L6
jmp L4
L6:
with the single instruction jne L4?
- Show that the addressing modes immediate, absolute memory, and register indirect can be simulated by register and register-offset alone.
- Show the target code that is generated for the source statement
X := Y; where X and Y are both 32-bit integers that are one step down the static chain from the current subprogram, by a code generator which emits access code for the two values independently. Assume $X$ is at offset $-8$ and $Y$ is at offset $-12$. How many registers are used? Then generate code for this statement by hand, intelligently.
- Suppose the variable $A$ was declared in an Ada program with
type array (21..38) of String(1..10)

and happened to have offset $-42$ in the frame of the subprogram in which it was declared. Suppose further that the variable J was declared in the same subprogram and had offset $-26$.
- Show the target code that loads the value of
A(J-1) into register eax that would be generated naïvely. Do not forget to show the bounds checking!
- Show target code to load the value of
A(J-1) into register eax in which the "-1" computation is folded in to the computation of the base address of A. Note that the bounds checking code will look a little different than in part (a).
- Write an assembly language program that takes zero or more command line arguments, which should all be integers, and displays the average of the parameters to standard output.
- Occasionally a compiler may output a sequence such as
mov [ebp-8], eax
mov eax, [ebp-8]
The second instruction might be able to be removed. But whether we are able to remove this instruction is undecidable. Why, exactly?
- The x86 has an
enter instruction which automatically makes a display. Research this instruction. Suppose a Carlos program had the following structure (indentation determines nesting):
function f, parameters: [x,y], locals: [a]
function g, parameters: [c], locals: [p,q,r,s]
function h, parameters: [a], locals: []
function k, parameters: [], locals: [z]
- Show what the runtime stack looks like from the call sequence
f→g→k→g→h→h→f→k
- What does the generated assembly language look like when
trying to access the value of f.x from h?
- Which parts of the Carlos compiler need to be rewritten
to use this instruction?
- The ENTER instruction is rarely used because it is slow. Show how slow it is by doing the following. Prepare a table with four columns. The left column will be:
enter n, 0
enter n, 1
enter n, 2
enter n, 3
...
and so on. The second column will be the number of clock cycles required on a Pentium for the particular ENTER instruction. The third column will be code equivalent to the ENTER instruction. For example, ENTER n, 1 is equivalent to:
push ebp
mov ebp, esp
push ebp
sub esp, n
The fourth column will be the number of clocks for the code in column 3.
- Show x86 code for the expression
x / y > (3 * x) || z || x < 3
where the "||" operator is short-circuit, and the variables $x$, $y$, and $z$ are all integer variables. Put the value of the expression in eax. Write the best possible code you can for the Pentium 4 processor.
- Write an assembly language function to compute $\frac{\sin(\log(x))}{y-7}$ where $x$ and $y$ are two double (64-bit float) parameters. Use the x86 C calling convention. Also write a C program that calls the function and displays the result.
- Write an assembly language function to compute the log base a of b, where x and y are two double (64-bit float) parameters. Use the x86 C calling convention. Write a C program for the unit tester (with at least 10 assert statements).
- Write an assembly language function to compute $\frac{y}{\sin \log \mathrm{atan2}(y,x)}$ where $x$ and $y$ are two double (64-bit float) parameters. Use the x86 C calling convention.
- Write an x86 assembly language program that sets every third byte of the three megabyte section of memory starting at address $b$. Use the MMX registers.
- Write an assembly language function that returns the dot product of two single-precision floating point arrays using the XMM registers. Implement a unit tester in C.
- Write an x86 assembly language function that returns the sum of the reciprocals of all the elements in an array of doubles. Use the C calling convention (so the function accepts the array and a length).
- Write an assembly language version of the following, using an
LEA instruction for the 3n+1 computation:
int C(int n) {
int count = 0;
while (n != 1) {
n = (n % 2 == 0) ? n / 2 : 3 * n + 1;
}
return count;
}
- Show both naive and optimized intermediate code (entity graph), and both naive and optimized assembly language for:
if (x % 4096 == 0) {printf("Don't say \66;\6f;\6f;!");}
Hint: you need strength reduction, too.
- One kind of strength reduction is replacing division by a power of two with an arithmetic right shift, for example
sar eax, 10 to divide by 1024
sar eax, 8 to divide by 256
This optimization is not safe. Explain why. Show how to make it safe, and explain both why your optimization works and why it is safe.
- Write an x86 assembly language function that takes in four doubles and returns the product of the largest and the smallest argument. Assume the function will be called from a C program built under gcc running on a Pentium II or above. Note that you need to respect the calling convention. Do not use conditional jumps in your code.
- Give highly optimized x86 code for the following:
for j := 5 to y do
y := j * 7 + c;
printInteger(y - 4);
end loop;
where y and c are local variables in the current procedure at offsets -12 and +16 respectively. Remember that the range is evaluated only once, the whole loop is skipped on the empty range, etc.). Make sure you respect the overflow semantics! Identify any induction expressions and explain how you optimized them. Compare your hand-written code with that generated by a real compiler.
- Write an x86 assembly language function to return the product of its input (which must be a double) and 7.0, without using multiplication or loops. USE AT MOST 4 ADDITIONS. The return type is double. Assume the function will be called from a C program built under gcc.
- Write the following in assembly language (use the C calling convention). It is supposed to compute a*log10(b). Use the
fyl2x and fldl2t instructions.
double f(double a, double b);
- What does this code do? For what ranges of n does it make sense?
mov eax, n
shl eax, 23
add eax, 3f800000h
mov [esp-4], eax
fld dword [esp-4]
- Generate code for the following basic block:
y := x * 4 + z;
z := p * y;
y := z;
x := z / y << x;
Runtime Systems
- A naïve way to implement a runtime system for a language with exceptions is to place two return addresses in an activation record. Sketch a small Ada or C++ function that can throw (a possibly user-defined) exception, and a code fragment that calls the function. Give a stack frame layout with two return addresses, one is the normal return address and the other is the address of the handler in the caller. Show the assembly language for the caller and the function itself.
- Discuss advantages and disadvantages of a subprogram call implementation in which (a) the calling subprogram saves all registers and (b) the called subprogram saves all registers. Explain why the x86’s C calling convention is a nice compromise.
- In a language that supports recursion, there may be multiple activations of a subprogram on the dynamic chain, and hence stack allocations of frames are generally used. However, subprograms that do not themselves make calls need not use stack frames. More generally, any subprogram that can never appear twice on a dynamic chain does not require a stack frame. Describe how to compute the set of all such subprograms at compile time.
- What exactly must be the case for a subprogram to not need a static link in its stack frame? Think up as many cases as possible.
- In Ada, C, and C++ arrays and records (structs) can be allocated on the stack, not just on the heap. When making assignments of aggregates to variables, compilers usually generate code to deposit the values in temporary storage. Why is this necessary in general? After all, in
Weekdays := Day_Set(False, True, True, True, True, True, False);
we could construct the aggregate directly in the variable Weekdays. Give an example of an assignment statement that illustrates the necessity of constructing an aggregate in temporary storage (before copying to the target variable).
Errors
Classify the following as a syntax error, static semantic (contextual) error, or not a compile time error. In the case where code is given, assume all identifiers are declared, have the expected type, and are in scope. All items refer to the Java language.
x+++-y
x---+y
- incrementing a read-only variable
- code in class C accessing a private field from class D
- Using an uninitialized variable
- Dereferencing a null reference
null instanceof C
!!x
x > y > z
if (a instanceof Dog d) {...}
var s = """This is weird""";
switch = 200;
x = switch (e) {case 1->5; default->8;};
- Identify the following errors as syntactic, static semantic, or dynamic semantic (runtime): If no language is mentioned for a particular case, it probably does not matter. Assume either C or Ada and write your assumption.
- Redeclaration of an identifier.
- Unbalanced parentheses.
- Applying an operator to an element of the wrong type.
- Array index out of bounds (in C, in Ada, ...).
- Division by zero.
- Semicolon after a block in C.
- Wrong number of arguments supplied to a call.
- Assignment of a variable of type T to a variable of type subtype of
T where the first variable is out of the range of the second in Ada.
- An unwanted infinite loop.
- Dereference of a null pointer.
- Application of the "." to an identifier which is not a field of the
record.
- Use of an uninitialized variable.
- Which if the following expressions are legal in Java (assuming $x$ and $y$ are integer variables)? State why they are legal or why they are not.
x---y
x-----y
- Classify the following as a syntax error, semantic error, or not a compile time error at all. In the case where code is given, assume all identifiers are properly declared and in scope. All items refer to the Java language.
x+++-y
x---+y
- incrementing a read-only variable
- accessing a private field in another class
- Using an uninitialized variable
- Dereferencing a null reference
null instanceof C
!!x
- Classify the following as (a) lexical error, (b) syntax error, (c) static semantic error, (d) dynamic semantic error, or (e) no error.
- A function call with no matching signature in Java.
- A function call with no matching signature in C.
x < y < z in Carlos, where x and y are ints and z is a boolean.
x < y < z in C, where x and y and z are all ints.
3[a] in Carlos, where a is an array variable.
3[a] in C, where a is an array variable.
char x = '\a'; in Carlos.
char x = '\a'; in C.
- Value returning function without a return statement, in Carlos.
- Value returning function without a return statement, in C.
- Semicolon after a block, in Carlos.
- Semicolon after a block, in C.
- Classify each of the following, assuming a typical statically typed language, as a (a) lexical error, (b) syntax error, (c) static semantic error, (d) dynamic semantic error, or (e) no error.
- Invoking an array constructor that accepts a length and produces an empty array of that length, with a negative argument
- Semicolons instead of commas in identifier lists
- An identifier that is 33 characters long
- Applying a length operator or function to a read-only array variable
- Applying a length operator to a struct
- Applying the
sin standard function to an integer
- The expression
x < y < z where $x$ and $y$ are integers, and $z$ is a boolean.
- Having the wrong number of arguments in a struct’s constructor.
- Dynamic semantic
- Syntax
- No error
- No error
- Static semantic
- No error
- Syntax
- Static semantic
- Find as many linter errors as you can in this Java source code file (C.java):
import java.util.HashMap;
class C {
static final HashMap<String, Integer> m = new HashMap<String, Integer>();
static int zero() {
return 0;
}
public C() {
}
}

You can use SonarLint or FindBugs or FindSecBugs or PMD or whatever you prefer. You might even need to use a combination of tools because it is possible no tool finds them all. (Please note you are not expected to already know what all the issues are here. The idea is to practice with tools and have good discussions with teammates. Find as many as you can, and read and understand each problem that is reported to you so you learn (1) what kinds of potential bugs and security problems can exist even in compilable and runnable code, and (2) the kinds of things that a static analyzer can detect.)
Computability
- The reachability problem is to determine for a given instruction, whether or not it might be executed for some run of the program. To optimize a program for space, we need to solve the reachability problem and remove all unreachable instructions. Show that this is impossible by reducing the halting problem to the reachability problem.
- Show the language $\{ \langle M_1\rangle \langle M_2\rangle \mid L(M_1) = L(M_2) \}$ is undecidable, using a rigorous reduction argument.
Optimization
- Give three examples of how aliasing can occur (you can use examples from several different languages). How does aliasing make copy propagation difficult? When, if ever, can an algorithm determine that a entity cannot possibly be aliased?
- Optimize the following. Show your work (that is, show a few intermediate steps toward your final solution, recording the optimizations you performed. You can abbreviate CP=copy propagation, CF=constant folding, DCE=dead code elimination. You’ll want to use more than just these three techniques.
L1:
r0 := x
z := 6
r1 := 4 - r0
r2 := 3 >= r1
if r2 == 0 goto L2
r3 := y + 4
r4 := *r3
z := r4
L2:
- Write, by hand, a super-efficient Squid fragment for the following Carlos fragment:
struct s {int x; int y; string s;}
s a = new s {
codepoint(getChar()), codepoint(getChar()), getString()};
while (a.x++ < a.y) {print($a.s[1]);}
- Here’s some Hana code that prints the elements of an integer array separated by commas:
for (int i = 0; i < #a; i++) {
print($a[i]);
print(", ") if (i != #a-1);
}
With optimizations turned off, my compiler produces:
p0:
copy 0, i1
L0:
copy [i0-4], r0
less i1, r0, r1
jz r1, L1
assert_not_null i0
copy [i0-4], r2
assert_in_range i1, 0, r2
mul i1, 4, r3
add i0, r3, r4
copy [r4], r5
to_string r5, r6
param r6
call __print, 4
copy [i0-4], r7
sub r7, 1, r8
not_equal i1, r8, r9
jz r9, L2
param s0
call __print, 4
L2:
inc i1
jump L0
L1:
exit
s0:
[44, 32]
- Describe, in high-level terms, what each of the assert
tuples are doing. Are both of them necessary? Why or why not?
- Rewrite this code fragment showing what it would look like
without using the variable i (but rather stepping through the
array elements by incrementing an internal pointer). Note that this
problem does not require you to know anything about how optimizers
work. You are only being asked to show off your understanding of
Squid to come up with a super-efficient Squid tuple sequence for
a specific algorithm.
- Suppose we have a compiler in which the ASTs for short-circuit or-expressions were modeled as binary expressions, instead of as a single node with two or more disjuncts.
- Draw the AST for
x || y || z under this assumption.
- Write out the tuple sequence produced by a naive translation of this tree.
- Write out a more efficient sequence of tuples (Hint: only one temporary should be needed).
- How would an optimizer detect that sequence is lousy (by looking only at the tuples)? What kind of transformations would an optimizer do (at the tuple level) to turn the lousy tuple sequence into the good one?
- Explain how treating these operators an $n$-ary rather than binary, simplifies this issue a great deal. Use a tree grammar in your explanation.
Code Generation
- For the following C function:
int f(const int n) {
return n % 2 == 0 ? n / 2 : 3 * n + 1;
}
give both a highly optimized WebAssembly translation and a highly optimized x86-64 translation. You do not have to do this by hand; instead, use the Compiler Explorer, and set the optimization to -O3. Include comments in the translated code.
Little Languages
- Here is a cool little functional language:
PROGRAM → (DECL ';')* EXPR
DECL → 'val' ID '=' EXPR
| fun ID '(' PARAMS? ')' '=' EXPR
EXPR → NUMLIT | ID | UOP EXPR | EXPR BOP EXPR
| EXPR '?' EXPR ':' EXPR | ID '(' ARGS? ')' | '(' EXPR ')'
PARAMS → ID (',' ID)*
ARGS → EXPR (',' EXPR)*
UOP → '-' | 'abs' | 'not'
BOP → '+' | '-' | '*' | '/' | 'mod' | 'and' | 'or' | '==' | '<'
- Why is this called a functional language?
- Is the grammar ambiguous? Why or why not?
- Give a hierarchy of entity classes for this language.
- Write a Greatest Common Divisor function in this language.
- Give three examples of syntax errors and three examples of static
semantic errors in this language. Make sure to write down
all your assumptions; I did not give you any semantics so you
will have to make up something reasonable.
- Here is a small expression language:
Exp → Exp Exp op | intlit
op → "+" | "-" | "*" | "/"
- What language is this?
- Is the grammar ambiguous? Why or why not?
- Is it $LL(k)$ for any $k$? If so, for which $k$? If not, why not?
- Give a class hierarchy of entities for this language.
- Give an attribute grammar for this language that can be used to evaluate expressions.
- Give an attribute grammar for this language that attaches a "nesting level" to each identifier. You will have to make a slight modification to the original grammar for this to make sense.
- This little language looks like an abstraction of something you might see in a real programming language. What, exactly? And is the grammar $LL(k)$?
G -> (S s G)?
S -> V q e | i f E g | V x
V -> i | V d i | V a E a
E -> n | V
- Remove the left recursion from this grammar:
B -> (a|b)*A | bba*c
A -> Ac | d
- Consider a language for describing vector graphics. An example program in this language (formatted ugly to highlight the fact that line breaks do not matter) is:
down deg color 1 0 0 left
90 forward 4 color 0 0 1 [ left 90
forward 1.5 ] right 90 forward 1.5 up
This program draws the letter T with a red vertical line of size 4 units and topped with a 3 unit blue line. A program is a sequence of instructions. The instructions are:
| Instruction | Description |
deg | switch to degree mode
|
rad | switch to radians mode
|
down | put the pen down so movements draw lines
|
up | pick the pen up so movements don't draw anything
|
left θ | turn counterclockwise by angle θ
|
right θ | turn clockwise by angle θ
|
forward n | draw a line by moving forward n units.
|
backward n | draw a line by moving backward n units.
|
color r g b | set color (r,g,b), values are floats in the range 0 to 1.
|
[ | save current state
|
] | restore previously saved state
|
Give an Ohm grammar for this. Also answer: is it even possible to give an unambiguous CFG for this language? Why or why not?
- Here is a description of a language. Programs in this language are made up of a possibly empty sequence of function declarations, followed by a single expression. Each function declaration starts with the keyword
func followed by the function’s name (an identifier), then a parenthesized list of zero or more parameters (also identifiers) separated by commas, then the body, which is a sequence of one or more expressions separated (NOT terminated) by semicolons with the expression sequence terminated with the keyword end. Expressions can be numeric literals, string literals, identifiers, function calls, or can be made up of other expressions with the usual binary arithmetic operators (plus, minus, times, divide) and a unary prefix negation and a unary postfix factorial (!). There’s a conditional expression with the syntax y if x else z. Factorial has the highest precedence, followed by negation, the multiplicative operators, the additive operators, and finally the conditional. Parentheses are used, as in most other languages, to group subexpressions. Numeric literals are non-empty sequences of decimal digits with an optional fractional part and an optional exponent part. String literals delimited with double quotes with the escape sequences \', \", \n, \\, and \u{hhhhhh} where hhhhhh is a sequence of one-to-six hexadecimal digits. Identifiers are non-empty sequences of letters, decimal digits, underscores, at-signs, and dollar signs, beginning with a letter or at-sign, that are not also reserved words. Function calls are formed with an identifier followed by a comma-separated list of expressions bracketed by square brackets. Comments are -- until the end of the line. Write a single example program that covers every aspect of this definition.
- For the language described in the previous exercise, write a complete syntactic description of this language in Ohm. (Hint: use the Ohm editor to check your work. Use your everything-program from the previous exercise as a positive test, but write a few negative tests, too, so that you can check that your grammar does not “match too much.” When grading, I will copy-paste your submitted solution into the Ohm editor with my own detailed test suite).
Programming Problems
Don’t worry if there are languages here you don’t know. Do some research. Learn something new today.
- In the notes on Theories of Computer Science we saw how to express the odd-number test in Lambda Calculus notation, Lisp, Python, JavaScript, Java, Ruby, Clojure, Swift, and Rust. Show how, in each of these notations or languages, to express a function to cube a number. (Research may be required, as some of these languages may be new to you.)