Using JavaCC

In practice, the scanning and parsing phases of a compiler are handled by code that is generated by a parser generator. One parser generator for Java is called JavaCC. Here's a tutorial.

Introduction

JavaCC is a lexer and parser generator for LL(k) grammars. You specify a language's lexical and syntactic description in a JJ file, then run javacc on the JJ file. You will get seven java files as output, including a lexer and a parser.

This page helps you get started using JavaCC. You still need to read the online documentation to do anything really useful with the tool though.

We'll look at three things you can do with JavaCC

  1. Do a simple syntax check only
  2. Make an actual interpreter
  3. Generate code

Getting Started

Make sure you have (or get) at least version 5.0.

The home page for JavaCC is http://javacc.java.net/.

Download it. Unzip it. You may want to make the javacc script accessible from your path.

A First Example: Syntax Checking

Let's start with the language of integer expressions with addition and multiplication, with addition having lower precedence. The typical LL(1) grammar for this is:

Microsyntax
    SKIP → [\x0d\x0a\x20\x09]
    NUM → [0-9]+
    TOKEN → [+*()] | NUM
Macrosyntax
    E → T ("+" T)*
    T → F ("*" F)*
    F → NUM | "(" E ")"

Make a file called Exp1.jj like so:

Exp1.jj
PARSER_BEGIN(SyntaxChecker)

public class SyntaxChecker {
    public static void main(String[] args) {
        try {
            new SyntaxChecker(new java.io.StringReader(args[0])).S();
            System.out.println("Syntax is okay");
        } catch (Throwable e) {
            // Catching Throwable is ugly but JavaCC throws Error objects!
            System.out.println("Syntax check failed: " + e.getMessage());
        }
    }
}

PARSER_END(SyntaxChecker)

SKIP:  { " " | "\t" | "\n" | "\r"                    }
TOKEN: { "(" | ")" | "+" | "*" | <NUM: (["0"-"9"])+> }

void S(): {} { E() <EOF>           }
void E(): {} { T() ("+" T())*      }
void T(): {} { F() ("*" F())*      }
void F(): {} { <NUM> | "(" E() ")" }

This is pretty much a minimal example. You must define a parser class between the markers PARSER_BEGIN and PARSER_END, and you must specify tokens in a TOKEN: clause, and parsing methods must be defined, one for each non-terminal in the grammar. The parsing functions look rather like the EBNF for a grammar: you'll just notice non-terminals look like function calls, tokens look like themselves, and you have the usual EBNF metasymbols like "|" (choice), "*" (Kleene star), and "+" (Kleene plus).

To process the JJ file, invoke

javacc Exp1.jj

and notice seven files are output. The one called SyntaxChecker.java should be of some interest; take a look at it! (Yeah, it's pretty ugly, but you didn't write it yourself, right?)

Some example runs:

$ javacc Exp1.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file Exp1.jj . . .
Parser generated successfully.
$ javac *.java
Note: SyntaxChecker.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
$ java SyntaxChecker "100"
Syntax is okay
$ java SyntaxChecker "8 * 3 * (2 + 2345234)"
Syntax is okay
$ java SyntaxChecker "(((3)) * 12)"
Syntax is okay
$ java SyntaxChecker "this doesn't work"
Syntax check failed: Lexical error at line 1, column 1.  Encountered: "t" (116),
 after : ""
$ java SyntaxChecker "(x = 3) + 111"
Syntax check failed: Lexical error at line 1, column 2.  Encountered: "x" (120),
 after : ""
$ java SyntaxChecker "1 2"
Syntax check failed: Encountered "2" at line 1, column 3.
Was expecting one of:
    <EOF>
    "+" ...
    "*" ...

If you want to get fancy with lexing, you can add comments to the language. For example:

SKIP: {
  " "
| "\t"
| "\n"
| "\r"
| <"//" (~["\n","\r"])* ("\n"|"\r")>
}

For lexical analysis, JavaCC allows other sections besides TOKEN and SKIP. There are also ways to make "private" tokens, and write complex regular expressions. See the JavaCC documentation for details. Also see the mini-tutorial on the JavaCC site for tips on writing lexer specifications from which JavaCC can generate efficient code.

Left Recursion

JavaCC cannot parse grammars that have left-recursion. It just can't. Suppose you tried to use the non-LL(k) grammar for the language above:

Macrosyntax
    E -> T | E "+" T
    T -> F | T "*" F
    F -> NUM | "(" E ")"

If you changed your JJ file accordingly and ran JavaCC, you would see this:

$ javacc BadLeftRecursion.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file BadLeftRecursion.jj . . .
Error: Line 22, Column 1: Left recursion detected: "T... --> T..."
Error: Line 21, Column 1: Left recursion detected: "E... --> E..."
Detected 2 errors and 0 warnings.

You must remove left recursion before writing your grammar rules in JavaCC. The general approach is to replace rules of the form

A → a | Ax

with

A → a (x)*

Lookahead

Let's add identifiers and assignment expressions to our language:

E -> id ":=" E | T ("+" T)*
T -> F ("*" F)*
F -> NUM | ID | "(" E ")"

If you naively write JavaCC for this grammar, JavaCC will say something like

$ javacc NotEnoughLookahead.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file NotEnoughLookahead.jj . . .
Warning: Choice conflict involving two expansions at
         line 24, column 16 and line 24, column 32 respectively.
         A common prefix is: <ID>
         Consider using a lookahead of 2 for earlier expansion.
Parser generated with 0 errors and 1 warnings.

What this means is that when expanding an E and looking at an id, we wouldn't know if that id is starting an assignment or is just a variable, unless we examine not just the id, but also the following token. So the parser needs to look at the next two symbols.

Understand that "choice points" can appear at various places within a grammar. They obviously appear at "|" seprators, but also in ()*, ()+ and ()? constructs too.

Exp3.jj
PARSER_BEGIN(SyntaxChecker)

public class SyntaxChecker {
    public static void main(String[] args) {
        try {
            new SyntaxChecker(new java.io.StringReader(args[0])).S();
            System.out.println("Syntax is okay");
        } catch (Throwable e) {
            // Catching Throwable is ugly but JavaCC throws Error objects!
            System.out.println("Syntax check failed: " + e.getMessage());
        }
    }
}

PARSER_END(SyntaxChecker)

SKIP: { " " | "\t" | "\n" | "\r" }
TOKEN: {
    "(" | ")" | "+" | "*" | ":="
    | <NUM: (["0"-"9"])+> | <ID: (["a"-"z"])+>
}

void S(): {} { E() <EOF>                                   }
void E(): {} { LOOKAHEAD(2) <ID> ":=" E() | T() ("+" T())* }
void T(): {} { F() ("*" F())*                              }
void F(): {} { <NUM> | <ID> | "(" E() ")"                  }

Sometimes no number of lookahead tokens is sufficient for the parser. Then you'd need to use syntactic lookahead. For sophisticated, or bizarre, parsing, sometimes semantic lookahead is needed. See the JavaCC Lookahead Mini-Tutorial for details.

Writing An Interpreter

If the syntax checker above looked like it had a lot of extra markup in its parsing functions, that's because parsers are supposed to do stuff while parsing. Parsing functions can take in parameters, return results, and invoke blocks of arbitrary Java code. Here's how we can actually evaluate the expressions, or, as they say, write an interpreter.

Exp2.jj
PARSER_BEGIN(Evaluator)

public class Evaluator {
    public static void main(String[] args) throws Exception {
        int result = new Evaluator(new java.io.StringReader(args[0])).S();
        System.out.println(result);
    }
}

PARSER_END(Evaluator)

SKIP:  { " " | "\t" | "\n" | "\r"                    }
TOKEN: { "(" | ")" | "+" | "*" | <NUM: (["0"-"9"])+> }

int S(): {int sum;}
{
  sum=E() <EOF> {return sum;}
}

int E(): {int sum, x;}
{
  sum=T() ("+" x=T() {sum += x;} )* {return sum;}
}

int T(): {int sum, x;}
{
  sum=F() ("*" x=F() {sum *= x;} )* {return sum;}
}

int F(): {int x; Token n;}
{
  n=<NUM> {return Integer.parseInt(n.image);}
|
  "(" x=E() ")" {return x;}
}

Some example runs

$ javacc Exp2.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file Exp2.jj . . .
Parser generated successfully.
$ javac *.java
Note: Evaluator.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
$ java Evaluator "34"
34
$ java Evaluator "34 * 2 + ((3 + 11))"
82

Generating an Abstract Syntax Tree

A quick and dirty way to do what most real parsers do, namely generate an abstract syntax tree, is to embed the tree classes in the JJ file, like so:

Exp4.jj
PARSER_BEGIN(Parser)

public class Parser {
    public static void main(String[] args) throws Exception {
        Exp result = new Parser(new java.io.StringReader(args[0])).S();
        System.out.println(result);
    }
}

// Classes defining the Abstract Syntax Tree
abstract class Exp {}
class Num extends Exp {
    int value;
    Num(int v) {value = v;}
    public String toString() {return value + "";}
}
class BinaryExp extends Exp {
    String op;
    Exp left, right;
    BinaryExp(String o, Exp l, Exp r) {op = o; left = l; right = r;}
    public String toString() {return "(" + op + " " + left + " " + right + ")";}
}

PARSER_END(Parser)

SKIP:  { " " | "\t" | "\n" | "\r"                    }
TOKEN: { "(" | ")" | "+" | "*" | <NUM: (["0"-"9"])+> }

Exp S(): {Exp e;}
{
  e=E() <EOF> {return e;}
}

Exp E(): {Exp e1; Exp e2;}
{
  e1=T() ("+" e2=T() {e1 = new BinaryExp("+", e1, e2);} )* {return e1;}
}

Exp T(): {Exp e1; Exp e2;}
{
  e1=F() ("*" e2=F() {e1 = new BinaryExp("*", e1, e2);} )* {return e1;}
}

Exp F(): {Exp e; Token n;}
{
  n=<NUM> {return new Num(Integer.parseInt(n.image));}
|
  "(" e=E() ")" {return e;}
}

The toString() method is convenient, though it might trick you into believing we made a code generator that targets LISP. But the tree is really there.

$ java Parser 54
54
$ java Parser 54+3
(+ 54 3)
$ java Parser "54 + 3 * 2 + 7"
(+ (+ 54 (* 3 2)) 7)
$ java Parser "54 + 3 * (2 + 7) * 33"
(+ 54 (* (* 3 (+ 2 7)) 33))

Integrating the Parser into a Compiler

In a real compiler, you don't dump a main method into the parser. You'd make a nice function called parse() perhaps, and call that from your own code. And you'd define the tree classes in their own files.

For example, here is the JavaCC specification for a parser for the Iki programming language: