This page describes a compiler for the language Carlos. The idea is to show how a medium-sized Java program is designed and implemented. It's on GitHub.
Before working with the compiler you should read and study the formal definition for the language Carlos.
The compiler is written in Java, so make sure your development box has a JDK (1.6 or higher), and Maven (version 3 or above). You should use a good IDE like Eclipse, too. I've done some stuff in the pom file to make it work nicely in Eclipse.
After you get the source code by cloning or forking the repo, just enter
$ mvn package
in the project root directory. You can then run a mini-Carlos IDE with:
$ java -cp target/carlos-1.0.jar edu.lmu.cs.xlg.carlos.Viewer
If you were writing this compiler from scratch, you would first come up with a list of things you need. This project requires:
We've set up this project the "right way" — the way that most Java projects in the world are arranged these days. The basic idea is that /src contains everything checked into version control, and /src/main contains everything bundled in the released product.
Here is a brief description of each of the directories in the project:
Start by browsing the pom file
Note how the pom contains, for the project:
Some code will be useful to all parts of the compiler. For example, we'd like to do some global error counting and logging, and have a consistent way to do internationalization. I have a little logger that can do these things:
Professional software is internationalized, so you'll want resource files. Here's the default resource file (with U.S. English):
There is one for Spanish:
For extra credit, I'll gladly accept versions of this file in other languages.
Carlos abstract syntax trees are made up of such things as programs, blocks, variables, types, functions, expressions, statements, and so on. The class hierarchy I came up with for Carlos is:
Class | Attributes |
---|---|
Entity | — |
Block | statements |
Program | — |
Declarable | name |
Variable | typename, initializer |
Type | — |
ArrayType | baseTypeName |
StructType | fields |
Function | returnTypeName, parameters, body |
Statement | — |
Declaration | declarable |
IncrementStatement | op, target |
CallStatement | functionName, args |
AssignmentStatement | left, right |
BreakStatement | — |
ReturnStatement | returnExpression |
IfStatement | cases |
WhileStatement | condition, body |
ClassicForStatement | typeName, index, initial, final, each, body |
Expression | — |
Literal | lexeme |
IntegerLiteral | value |
RealLiteral | value |
StringLiteral | vaue |
CharLiteral | value |
BooleanLiteral | value |
NullLiteral | — |
VariableExpression | — |
SimpleVariableReference | name |
CallExpression | functionName, args |
SubscriptedVariable | array, index |
DottedVariable | struct, fieldname |
ArrayAggregate | typename, args |
StructAggregate | typename, args |
EmptyArray | typename, bounds |
PrefixExpression | op, operand |
PostfixExpression | op, operand |
InfixExpression | op, left, right |
StructField | name, typename |
Case | condition, body |
We have a class for each entity. There is a field for
each syntactic attribute and the constructor fills in those fields.
The classes are in src/main/java/edu/lmu/cs/xlg/carlos/entities
.
The src/main/java
directory will be the base directory for Java compilations, so the
package name for the entity classes will be
edu.lmu.cs.xlg.carlos.entities
.
Here is an example
package edu.lmu.cs.xlg.carlos.entities; /** * A statement of the form "while e b". */ public class WhileStatement extends Statement { private Expression condition; private Block body; public WhileStatement(Expression condition, Block body) { this.condition = condition; this.body = body; } public Expression getCondition() { return condition; } public Block getBody() { return body; } }
The Carlos grammar is not LL(k) (the rules for TYPE and VAR have left recursion), so when using JavaCC you'll notice some little tweaks.
The pattern used for parsing functions in the compiler is the one in which all parsing functions return an entity. For example:
Program parseProgram(): {List<Statement> statements = new ArrayList<Statement>(); Statement s;} { ( s = parseStmt() {statements.add(s);} )+ <EOF> {return new Program(statements);} }
There are a couple places where you'll want to change the grammar: since increments and calls can be both expressions and statements, you'll want to split up parsing functions; for example, parseCallExpression and parseCallStatement.
My completed JJ file for Carlos is (please study it):
Semantic analysis modifies the abstract syntax tree by doing things like resolving names (such as type names, variable names and function names) so that nodes can now point to real types and real variables and real functions. It also does things like type checking and argument count checking, logging errors whenever it finds problems.
The way semantic analysis works is by defining analyze()
methods for each
entity class. These methods not only do semantic checking and "fill in semantic fields", but
they also (necessarily) invoke the analyze()
methods of the nodes they reference.
We end up doing a depth-first traversal through the AST, transforming it into a semantic graph.
The analyze methods all operate with an AnalysisContext
object. Here is how
the method is described in the root Entity
class:
/** * Performs semantic analysis on this entity, and (necessarily) on its descendants. */ public abstract void analyze(AnalysisContext context);
The AnalysisContext
object has four fields: (1) a logger to use, (2) the
current symbol table in effect, (3) the current function you are inside of, which is important
when you are checking return statements, and (4) whether or not you are in a loop, which is
important to check break statements. Here is a part of this class:
public class Entity { ... public static class AnalysisContext { private Log log; private SymbolTable table; private Function function; private boolean inLoop; public static AnalysisContext makeGlobalContext(Log log) ... public AnalysisContext withTable(SymbolTable table) ... public AnalysisContext withFunction(Function function) ... public AnalysisContext withInLoop(boolean inLoop) ... public Log getLog() ... public SymbolTable getTable() ... public Function getFunction() ... public boolean isInLoop() ... public Type lookupType(String name) ... public Variable lookupVariable(String name) ... public Function lookupFunction(String name, List<Expression> args) ... public void error(String errorKey, Object... arguments) ... } }
Here are some examples of analzye()
methods. Sometimes analysis is trivial,
you just analyze your children nodes. This one is from PrintStatement.java:
@Override public void analyze(AnalysisContext context) { for (Expression arg : args) { arg.analyze(context); } }
Here's the one from WhileStatement.java. Note it has to analyze both its condition and
its body, but it also checks that its condition has type boolean (using a helper method on the
class Expression
). Note that when we analyze the body of the while statement, we
pass in a new context object which is like the original one, but has its inLoop
property
set to true
.
@Override public void analyze(AnalysisContext context) { condition.analyze(context); condition.assertBoolean("while_statement_condition", context); body.analyze(context.withInLoop(true)); }
From Variable.java. When we did syntax analysis, we only found a type name. During
semantic analysis, we have to lookup the type that the name maps to. If lookupType
does not find a type, then an error will be logged in the context, and the method will return
Type.ARBITRARY
. This special type typechecks cleanly with everything else; it helps
to avoid spurious errors.
@Override public void analyze(AnalysisContext context) { type = context.lookupType(typename); // If an initializer is present, analyze it and check types. if (initializer != null) { initializer.analyze(context); initializer.assertAssignableTo(type, context, "variable_initialization"); } }
From AssignmentStatement.java:
@Override public void analyze(AnalysisContext context) { left.analyze(context); right.analyze(context); left.assertWritable(context); right.assertAssignableTo(left.type, context, "assignment"); }
From BreakStatement.java:
@Override public void analyze(AnalysisContext context) { if (!context.isInLoop()) { context.error("break_not_in_loop"); } }
And now for one a little more complex. From CallExpression.java:
@Override public void analyze(AnalysisContext context) { // Analyze all the arguments for (Expression a: args) { a.analyze(context); } // Find out which function we're calling function = context.lookupFunction(functionName, args); if (function == null) { // If we can't find the function, just forget it type = Type.ARBITRARY; return; } // Since called from expression, must have a return type if (function.getReturnType() == null) { context.error("void_function_in_expression", functionName); type = Type.ARBITRARY; } else { type = function.getReturnType(); } }
It's easier to do semantic analysis if you have a good symbol table class. Here's mine:
Remember that unit testing involves making a non-interactive tester that can run on its own and test for hours. It should never rely on humans to "look at the output to see if each test is okay." Instead write code that processes zillions of inputs and does a lot of assertions. All modern Java systems use the JUnit framework for testing. The Carlos compiler automatically runs junit because the POM is set up to do that.
This tester works by picking up all the files named *.carlos
in the src/test/resources
directory and running them through
the front end. Filenames beginning with "synerror" are expected to
have syntax errors. Filenames beginning with "semerror" are expected
to have static semantic errors. All other files are expected to pass
semantic analysis with no errors. The tester counts the number of tests
and the number of failures and gives you a report.
Note that the tester is really only a shell; the tests themselves are only thorough if you write hundreds of little Carlos apps.
During development, you will find yourself needing to "see" the syntactic and semantic structures output by the compiler. I wrote a very simple, but fairly nice, GUI tool for this:
This is in no way a substitute for the unit tests; it's just a neat learning tool.
To use the tool, start typing a Carlos program into the left pane, then hit one of the four buttons to see (1) its Abstract Syntax Tree, (2) its semantic graph, (3) its optimized semantic graph, and (4) its translation into JavaScript. On success, the output pane will be green.
When there are errors detected by the compiler, the output pane will be yellowish:If the compiler itself throws an exception (hey it's a free hackathon-quality app), then the output pane will be red. I don't have any recent screenshots for this case, but here's one that showed up during development:
There's a pretty straightforward translator that walks the semantic graph and outputs JavaScript on the fly:
I've included a handful of simple optimizations. Most of the entity classes have an
optimize()
method. The best way to study these is by browsing the code. But
here's one snippet, in case you are interested; it's from the while statement:
@Override public Statement optimize() { condition = condition.optimize(); body.optimize(); if (condition.isFalse()) { return null; } return this; }
There are two classes in the compiler with main
methods:
You can run the compiler from the command line by saying:
java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler
though it would be easier perhaps to wrap this call in a shell
script, something called carlos.bat
on Windows, and
just plain carlos
everywhere else. For bash, your
script would look like this:
java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler $@
and for Windows
java -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler %1 %2
The Carlos compiler accepts several command line arguments:
carlos (display help) carlos -syn myprogram (produces the syntax tree, to stdout) carlos -sem myprogram (produces a semantic graph, to stdout) carlos -opt myprogram (produces an optimized semantic graph, to stdout) carlos -js myprogram (produces JavaScript in an output file)
To get the compiler to use Spanish, you can invoke (for example)
java -Duser.language=es -cp carlos-1.0.jar edu.lmu.cs.xlg.carlos.Compiler -js myprogram
I'll leave to to you to figure out how to get your script to deal with that.
Here is a program with semantic errors (in the file bad.carlos):
int x; int x; x = y + z;
If we ask the compiler to check syntax only
which shows there are no syntax errors, but:
This one is okay (the file is hello.carlos):
print "Hello";
The compiler produces, in hello.carlos.js:
(function () { console.log("Hello"); }());
Look in the class edu.lmu.cs.xlg.carlos.Compiler — you'll see methods you can call to compile a Carlos program. If you would like to compile Carlos programs as part of a different application, just include carlos.jar and you're good to go.