Let’s write a complete compiler for the Bella Programming Language. If you’re not familiar with the language, browse the official definition now.
We’ll write the compiler in JavaScript. These notes assume you are somewhat acquainted with Node.js, NPM, Git, Ohm, and the Bella language.
A good way to start is to create a project on GitHub, asking GitHub to create the README, the license, and the .gitignore. Select “Node” for the .gitignore template: GitHub knows what to do.
Clone your project, flesh out the README, run npm init
, and put "type": "module"
in your package.json. If you like, build out the folder structure as follows:
. ├── .gitignore (made by GitHub) ├── LICENSE (made by GitHub) ├── README.md (made by GitHub) ├── .prettierrc.json (totally optional) ├── docs │ └── bellalogo.png ├── examples │ └── hello.bella ├── package.json (made by "npm init") ├── src │ ├── bella.js │ ├── compiler.js │ ├── bella.ohm │ ├── core.js │ ├── parser.js │ ├── analyzer.js │ ├── optimizer.js │ └── generator.js └── test ├── compiler.test.js ├── parser.test.js ├── analyzer.test.js ├── optimizer.test.js └── generator.test.js
Run npm install ohm-js graph-stringify
and npm install c8 mocha --save-dev
, as we will use Mocha for testing, c8 for coverage, Ohm for lexing and parsing, and graph-stringify to visualize our abstract syntax trees and semantic graphs. Replace the "test:"
portion in the scripts
section of package.json to be "test": "c8 node_modules/.bin/mocha"
or whatever works for you.
For our project architecture, we’ll have a core module that defines the language entities (including those in the standard library), a compiler module that includes a compile
function to run the phases in sequence, and a file for a command line interface. The phases of parsing, static analysis, optimization, and code generation each have their own modules. The parser is driven by the grammar, stored in its own Ohm file:
Need more precise instructions?Although we’re now just looking at this existing compiler, we will review how to make Nodejs as we go. The steps to get the README and the package.json created were:
- Create a new project at GitHub, entering the name and description, checking the Public radio button, checking the “Add a README file” box (important), checking “Add .gitigore” and choosing Node for the template (also important), and checking “Choose a license” and selecting MIT.
- Clone the repo.
npm init -y
npm install mocha --save-dev
npm install c8 --save-dev
- Edit package.json so that the
test
line reads"test": "c8 node_modules/.bin/mocha"
- Edit package.json to add a line near the top that says
"type": "module"
. (This does not happen automatically! Without this line you get ugly, buggy, old-fashioned, yucky JavaScript.)- Perform any other package.json cleanups you like (author, keywords, etc.)
- Create a .prettierrc.json file (optional)
- Make sure the Prettier plugin is installed in the editor you are using. Configure things so that you at least “format on save.” You are morally obligated to use a code formatter. Don’t format on your own. It’s not worth it.
- Create your logo and save it in a new docs folder.
- Make some edits to your README.file, enough to be interesting. Make sure your logo is visible at the top of this file. (READMEs are best with logos!)
- Git add, git commit, git push (you should know how to do this already...if not, ask! Then do a Git tutorial.)
Sure, it looks overwhelming the first, time, but once you do it a few times, it becomes second nature.
We’ll add the rest of the files in the notes below.
Here’s the Ohm grammar for Bella grammar, copied from the language definition page:
Bella { Program = Statement+ Statement = let id "=" Exp ";" -- vardec | function id Params "=" Exp ";" -- fundec | Exp7_id "=" Exp ";" -- assign | print Exp ";" -- print | while Exp Block -- while Params = "(" ListOf<id, ","> ")" Block = "{" Statement* "}" Exp = ("-" | "!") Exp7 -- unary | Exp1 "?" Exp1 ":" Exp -- ternary | Exp1 Exp1 = Exp1 "||" Exp2 -- binary | Exp2 Exp2 = Exp2 "&&" Exp3 -- binary | Exp3 Exp3 = Exp4 ("<="|"<"|"=="|"!="|">="|">") Exp4 -- binary | Exp4 Exp4 = Exp4 ("+" | "-") Exp5 -- binary | Exp5 Exp5 = Exp5 ("*" | "/" | "%") Exp6 -- binary | Exp6 Exp6 = Exp7 "**" Exp6 -- binary | Exp7 Exp7 = num | true | false | id "(" ListOf<Exp, ","> ")" -- call | id -- id | "(" Exp ")" -- parens let = "let" ~idchar function = "function" ~idchar while = "while" ~idchar true = "true" ~idchar false = "false" ~idchar print = "print" ~idchar keyword = let | function | while | true | false num = digit+ ("." digit+)? (("E" | "e") ("+" | "-")? digit+)? id = ~keyword letter idchar* idchar = letter | digit | "_" space += "//" (~"\n" any)* -- comment }
Let’s write the parser first. It’s fairly trivial, because Ohm does the work for us once we write the grammar. We do have to set things up just a little to read the grammar from a file and to invoke the match method on the grammar to produce a match result object:
// The parse() function uses Ohm to produce a match object for a given // source code program, using the grammar in the file bella.ohm. import * as fs from "node:fs" import * as ohm from "ohm-js" const grammar = ohm.grammar(fs.readFileSync("src/bella.ohm")) // Returns the Ohm match if successful, otherwise throws an error export default function parse(sourceCode) { const match = grammar.match(sourceCode) if (!match.succeeded()) throw new Error(match.message) return match }
We’ll return a match result object if the match was successful, and throw an error with the match’s message
property already set by Ohm. This message is very convenient; here’s an example of what might go wrong when asked to parse a program with a syntax error:
Error: Line 2, col 14:
1 | let x = 5;
> 2 | print x - 2 /;
^
3 | print 5;
Expected "(", a letter, "false", "true", or a digit
To test the parser, we have to both (1) ensure the parser accepts programs that are syntactically correct, and (2) for syntactically incorrect programs, ensure that an error is thrown and that the error message contains expected text. Here’s a quick test. Perhaps it could be a little better, but it’s a good start, and has complete coverage:
import assert from "node:assert/strict" import parse from "../src/parser.js" const syntaxChecks = [ ["all numeric literal forms", "print(8 * 89.123);"], ["complex expressions", "print(83 * ((((-((((13 / 21)))))))) + 1 - 0);"], ["all unary operators", "print (-3); print (!false);"], ["all binary operators", "print x && y || z * 1 / 2 ** 3 + 4 < 5;"], ["all arithmetic operators", "let x = (!3) * 2 + 4 - (-7.3) * 8 ** 13 / 1;"], ["all relational operators", "let x = 1<(2<=(3==(4!=(5 >= (6>7)))));"], ["all logical operators", "let x = true && false || (!false);"], ["the conditional operator", "print x ? y : z;"], ["end of program inside comment", "print(0); // yay"], ["comments with no text are ok", "print(1);//\nprint(0);//"], ["non-Latin letters in identifiers", "コンパイラ = 100;"], ] const syntaxErrors = [ ["non-letter in an identifier", "ab😭c = 2", /Line 1, col 3/], ["malformed number", "x= 2.", /Line 1, col 6/], ["missing semicolon", "x = 3 y = 1", /Line 1, col 7/], ["a missing right operand", "print(5 -", /Line 1, col 10/], ["a non-operator", "print(7 * ((2 _ 3)", /Line 1, col 15/], ["an expression starting with a )", "x = );", /Line 1, col 5/], ["a statement starting with expression", "x * 5;", /Line 1, col 3/], ["an illegal statement on line 2", "print(5);\nx * 5;", /Line 2, col 3/], ["a statement starting with a )", "print(5);\n) * 5", /Line 2, col 1/], ["an expression starting with a *", "x = * 71;", /Line 1, col 5/], ] describe("The parser", () => { for (const [scenario, source] of syntaxChecks) { it(`properly specifies ${scenario}`, () => { assert(parse(source).succeeded()) }) } for (const [scenario, source, errorMessagePattern] of syntaxErrors) { it(`does not permit ${scenario}`, () => { assert.throws(() => parse(source), errorMessagePattern) }) } })
A focus of compiler writing is understanding the language in terms of its major semantic abstractions, called entities. When we analyze a program, we construct a representation of it using a graph of entities. For example, this Bella program:
let x = sqrt(9); function f(x) = 3 * x; while (true) { x = 3; print 0 ? f(x) : 2; }
is represented with this entity graph:
By reading the Bella language description, we would come up with the set of needed entities:
Class | Properties | Notes |
---|---|---|
Program | statements: Statement[] | A program is a sequence of statements. |
VariableDeclaration | variable: Variable initializer: Expression | A variable declaration defines a new variable and supplies its initial value. |
Variable | name: String readOnly: Boolean | A variable is one of two entity kinds in Bella (the other is the function). Variables have a name and can be read-only or writable. All variables in Bella have a numeric type. Bella is dynamically-typed anyway, so no type constraint is needed. |
FunctionDeclaration | fun: Function params: Variable[] body: Statement[] | A function declaration defines a new function entity and supplies it with a body. The parameters can be stored with the declaration, rather than with the function directly, since Bella is simple enough to not need anything more than a parameter count to be stored with the function. |
Function | name: String parameterCount: Int | A function is one of two entity kinds in Bella (the other is the variable). Functions have a name. We also keep a parameter count with them because when called, the number of call arguments must equal the number of function parameters. In Bella, we don’t have to store parameter names or types; the count alone suffices. |
Assignment | target: Variable source: Expression | Assignment statements assign a source expression to a target variable. Nothing more, nothing less. |
WhileStatement | test: Expression body: Statement[] | A while statement has a test expression and a sequence of statements for a body. |
PrintStatement | argument: Expression | The Bella print statement prints a single expression. |
Call | callee: Function args: Expression[] | A function is called with zero or more expression arguments. |
Conditional | test: Expression consequent: Expression alternate: Expression | The conditional expression x?y:z features three constituent expressions. |
BinaryExpression | op: String left: Expression right: Expression | All expressions with a binary operator are represented by an object of this class. |
UnaryExpression | op: String operand: Expression | An expression with a unary operator and a single operand. |
A language core should also define objects for the actual variables and functions that are predefined (part of a “standard library”). To keep the compiler lightweight, we’ll use raw JavaScript objects to represent the program representation nodes. Each node will have a distinguished field called kind
that serves as the, well, kind of node that it is. (If you were using TypeScript, you could get much more rigorous and define classes for each of the node types.)
export function program(statements) { return { kind: "Program", statements } } export function variableDeclaration(variable, initializer) { return { kind: "VariableDeclaration", variable, initializer } } export function functionDeclaration(fun, params, body) { return { kind: "FunctionDeclaration", fun, params, body } } export function assignment(target, source) { return { kind: "Assignment", target, source } } export function whileStatement(test, body) { return { kind: "WhileStatement", test, body } } export function printStatement(argument) { return { kind: "PrintStatement", argument } } export function call(callee, args) { return { kind: "Call", callee, args } } export function conditional(test, consequent, alternate) { return { kind: "Conditional", test, consequent, alternate } } export function binary(op, left, right) { return { kind: "BinaryExpression", op, left, right } } export function unary(op, operand) { return { kind: "UnaryExpression", op, operand } } export function variable(name, readOnly) { return { kind: "Variable", name, readOnly } } export function fun(name, paramCount) { return { kind: "Function", name, paramCount } } export const standardLibrary = Object.freeze({ π: variable("π", true), sqrt: fun("sqrt", 1), sin: fun("sin", 1), cos: fun("cos", 1), exp: fun("exp", 1), ln: fun("ln", 1), hypot: fun("hypot", 2), })
Ohm’s match result is a representation of the concrete syntax tree. We’ll want to turn that into an abstract syntax tree using the core objects we’ve just seen. But while converting the CST into an AST, we’ll need to also perform a number of static contextual checks, that cannot be captured in the grammar, such as:
These checks are made in the static analysis (also known as the semantic analysis) phase of the compiler front-end, so its nicely handled by context objects. Context objects keep track of the local variables that have been defined in the current scope. Because Bella’s scopes are nested, each context object will hold a reference to its parent context.
The initial scope contains the standard library entities, and new scopes are created for functions.
Earlier, we saw how for failed matches, Ohm’s match result object contained a very convenient message property showing contextual information for the error. For errors discovered in our analyzer, we have to explicitly invoke the getLineAndColumnMessage
method on parse tree nodes. One way to do this is to make a special validation function through which all errors pass. Whenever our analyzer detects an error, it passes the parse tree node as a parameter to the validation function, and the validator packages up a nice error message and throws.
Error: Line 2, col 1:
1 | print(exp(π));
> 2 | π = 3;
^
3 |
The identifier π is read only
Here’s the analyzer:
// The semantic analyzer exports a single function, analyze(match), that // accepts a grammar match object (the CST) from Ohm and produces the // internal representation of the program (pretty close to what is usually // called the AST). This representation also includes entities from the // standard library, as needed. import * as core from "./core.js" class Context { constructor({ parent, locals = {} }) { this.parent = parent this.locals = new Map(Object.entries(locals)) } add(name, entity) { this.locals.set(name, entity) } lookup(name) { return this.locals.get(name) || this.parent?.lookup(name) } } export default function analyze(match) { // Track the context manually via a simple variable. The initial context // contains the mappings from the standard library. Add to this context // as necessary. When needing to descent into a new scope, create a new // context with the current context as its parent. When leaving a scope, // reset this variable to the parent context. let context = new Context({ locals: core.standardLibrary }) // The single gate for error checking. Pass in a condition that must be true. // Use errorLocation to give contextual information about the error that will // appear: this should be an object whose "at" property is a parse tree node. // Ohm's getLineAndColumnMessage will be used to prefix the error message. function must(condition, message, errorLocation) { if (!condition) { const prefix = errorLocation.at.source.getLineAndColumnMessage() throw new Error(`${prefix}${message}`) } } function mustNotAlreadyBeDeclared(name, at) { must(!context.locals.has(name), `Identifier ${name} already declared`, at) } function mustHaveBeenFound(entity, name, at) { must(entity, `Identifier ${name} not declared`, at) } function mustBeAVariable(entity, at) { // Bella has two kinds of entities: variables and functions. must(entity?.kind === "Variable", `Functions can not appear here`, at) } function mustBeAFunction(entity, at) { must(entity?.kind === "Function", `${entity.name} is not a function`, at) } function mustNotBeReadOnly(entity, at) { must(!entity.readOnly, `${entity.name} is read only`, at) } function mustHaveCorrectArgumentCount(argCount, paramCount, at) { const equalCount = argCount === paramCount must(equalCount, `${paramCount} argument(s) required but ${argCount} passed`, at) } const builder = match.matcher.grammar.createSemantics().addOperation("rep", { Program(statements) { return core.program(statements.children.map(s => s.rep())) }, Statement_vardec(_let, id, _eq, exp, _semicolon) { // Analyze the initializer *before* adding the variable to the context, // because we don't want the variable to come into scope until after // the declaration. That is, "let x=x;" should be an error (unless x // was already defined in an outer scope.) const initializer = exp.rep() const variable = core.variable(id.sourceString, false) mustNotAlreadyBeDeclared(id.sourceString, { at: id }) context.add(id.sourceString, variable) return core.variableDeclaration(variable, initializer) }, Statement_fundec(_fun, id, parameters, _equals, exp, _semicolon) { // Start by adding a new function object to this context. We won't // have the number of params yet; that will come later. But we have // to get the function in the context right way, to allow recursion. const fun = core.fun(id.sourceString) mustNotAlreadyBeDeclared(id.sourceString, { at: id }) context.add(id.sourceString, fun) // Add the params and body to the child context, updating the // function object with the parameter count once we have it. context = new Context({ parent: context }) const params = parameters.rep() fun.paramCount = params.length const body = exp.rep() context = context.parent // Now that the function object is created, we can make the declaration. return core.functionDeclaration(fun, params, body) }, Params(_open, idList, _close) { return idList.asIteration().children.map(id => { const param = core.variable(id.sourceString, true) // All of the parameters have to be unique mustNotAlreadyBeDeclared(id.sourceString, { at: id }) context.add(id.sourceString, param) return param }) }, Statement_assign(id, _eq, exp, _semicolon) { const target = id.rep() mustNotBeReadOnly(target, { at: id }) return core.assignment(target, exp.rep()) }, Statement_print(_print, exp, _semicolon) { return core.printStatement(exp.rep()) }, Statement_while(_while, exp, block) { return core.whileStatement(exp.rep(), block.rep()) }, Block(_open, statements, _close) { return statements.children.map(s => s.rep()) }, Exp_unary(op, exp) { return core.unary(op.sourceString, exp.rep()) }, Exp_ternary(exp1, _questionMark, exp2, _colon, exp3) { return core.conditional(exp1.rep(), exp2.rep(), exp3.rep()) }, Exp1_binary(exp1, op, exp2) { return core.binary(op.sourceString, exp1.rep(), exp2.rep()) }, Exp2_binary(exp1, op, exp2) { return core.binary(op.sourceString, exp1.rep(), exp2.rep()) }, Exp3_binary(exp1, op, exp2) { return core.binary(op.sourceString, exp1.rep(), exp2.rep()) }, Exp4_binary(exp1, op, exp2) { return core.binary(op.sourceString, exp1.rep(), exp2.rep()) }, Exp5_binary(exp1, op, exp2) { return core.binary(op.sourceString, exp1.rep(), exp2.rep()) }, Exp6_binary(exp1, op, exp2) { return core.binary(op.sourceString, exp1.rep(), exp2.rep()) }, Exp7_parens(_open, exp, _close) { return exp.rep() }, Exp7_call(id, _open, expList, _close) { // ids used in calls must have already been declared and must be // bound to function entities, not to variable entities. const callee = context.lookup(id.sourceString) mustHaveBeenFound(callee, id.sourceString, { at: id }) mustBeAFunction(callee, { at: id }) const args = expList.asIteration().children.map(arg => arg.rep()) mustHaveCorrectArgumentCount(args.length, callee.paramCount, { at: id }) return core.call(callee, args) }, Exp7_id(id) { // ids used in expressions must have been already declared and must // be bound to variable entities, not function entities. const entity = context.lookup(id.sourceString) mustHaveBeenFound(entity, id.sourceString, { at: id }) mustBeAVariable(entity, { at: id }) return entity }, true(_) { return true }, false(_) { return false }, num(_whole, _point, _fraction, _e, _sign, _exponent) { return Number(this.sourceString) }, }) return builder(match).rep() }
And here’s a test:
import assert from "node:assert/strict" import parse from "../src/parser.js" import analyze from "../src/analyzer.js" import * as core from "../src/core.js" const semanticChecks = [ ["variables can be printed", "let x = 1; print x;"], ["variables can be reassigned", "let x = 1; x = x * 5 / ((-3) + x);"], ["all predefined identifiers", "print ln(sqrt(sin(cos(hypot(π,1) + exp(5.5E2)))));"], ] const semanticErrors = [ ["using undeclared identifiers", "print(x);", /Identifier x not declared/], ["a variable used as function", "x = 1; x(2);", /Expected "="/], ["a function used as variable", "print(sin + 1);", /Functions can not appear here/], ["re-declared identifier", "let x = 1; let x = 2;", /Identifier x already declared/], ["an attempt to write a read-only var", "π = 3;", /π is read only/], ["too few arguments", "print(sin());", /1 argument\(s\) required but 0 passed/], ["too many arguments", "print(sin(5, 10));", /1 argument\(s\) required but 2 passed/], ] const sample = "let x=sqrt(9);function f(x)=3*x;while(true){x=3;print(0?f(x):2);}" describe("The analyzer", () => { for (const [scenario, source] of semanticChecks) { it(`recognizes ${scenario}`, () => { assert.ok(analyze(parse(source))) }) } for (const [scenario, source, errorMessagePattern] of semanticErrors) { it(`throws on ${scenario}`, () => { assert.throws(() => analyze(parse(source)), errorMessagePattern) }) } it(`produces the expected graph for the simple sample program`, () => { const program = analyze(parse(sample)) let x = core.variable("x", false) let f = core.fun("f", 1) let localX = core.variable("x", true) assert.deepEqual( program, core.program([ core.variableDeclaration(x, core.call(core.standardLibrary.sqrt, [9])), core.functionDeclaration(f, [localX], core.binary("*", 3, localX)), core.whileStatement(true, [ core.assignment(x, 3), core.printStatement(core.conditional(0, core.call(f, [x]), 2)), ]), ]) ) }) })
We were careful to check both statically valid and invalid code snippets, as well as checking that the analyzer could produce an expected semantic graph for a non-trivial program.
After analysis, we still have a faithful representation of the source language syntax, we have all the useful line and column numbers, and have representations of every function and every variable, whether used or not.
The next phase collapses things quite a bit. It gets rid of unreachable and dead code, and performs a bunch of algebraic simplifications where it can. For our Bella compiler, we have the following opportunities:
3 + 5
with 8)x = x;
)Run-time error handling?
Sometimes, compilers will keep the information about source code with line and column numbers around and actually bundle this into the translated code, so that any errors that occur at run time can be reported in great detail.
However, in this Bella compiler, we won’t do this. Bella is so static, that all error reporting has been done in the analyzer. We don’t ever have run-time errors.
Here’s the optimized version of our running example:
Because of the way the optimizer collapses the semantic graph, each optimization method returns a graph fragment. Here’s the module:
// The optimizer module exports a single function, optimize(node), to perform // machine-independent optimizations on the analyzed semantic representation. // // The only optimizations supported here are: // // - assignments to self (x = x) turn into no-ops // - constant folding // - some strength reductions (+0, -0, *0, *1, etc.) // - Conditionals with constant tests collapse into a single arm import { unary } from "./core.js" export default function optimize(node) { return optimizers?.[node.kind]?.(node) ?? node } const optimizers = { Program(p) { p.statements = p.statements.flatMap(optimize) return p }, VariableDeclaration(d) { d.initializer = optimize(d.initializer) return d }, FunctionDeclaration(d) { d.body = optimize(d.body) return d }, Assignment(s) { s.source = optimize(s.source) if (s.source === s.target) { return [] } return s }, WhileStatement(s) { s.test = optimize(s.test) s.body = s.body.flatMap(optimize) return s }, PrintStatement(s) { s.argument = optimize(s.argument) return s }, Call(c) { c.callee = optimize(c.callee) c.args = c.args.map(optimize) if (c.args.length === 1 && c.args[0].constructor === Number) { if (c.callee.name === "sqrt") return Math.sqrt(c.args[0]) if (c.callee.name === "sin") return Math.sin(c.args[0]) if (c.callee.name === "cos") return Math.cos(c.args[0]) if (c.callee.name === "ln") return Math.log(c.args[0]) if (c.callee.name === "exp") return Math.exp(c.args[0]) } return c }, Conditional(c) { c.test = optimize(c.test) c.consequent = optimize(c.consequent) c.alternate = optimize(c.alternate) if (c.test.constructor === Number || c.test.constructor === Boolean) { return c.test ? c.consequent : c.alternate } return c }, BinaryExpression(e) { e.left = optimize(e.left) e.right = optimize(e.right) if (e.left.constructor === Number) { if (e.right.constructor === Number) { if (e.op === "+") { return e.left + e.right } else if (e.op === "-") { return e.left - e.right } else if (e.op === "*") { return e.left * e.right } else if (e.op === "/") { return e.left / e.right } else if (e.op === "%") { return e.left % e.right } else if (e.op === "**" && !(e.left === 0 && e.right == 0)) { return e.left ** e.right } } else if (e.left === 0 && e.op === "+") { return e.right } else if (e.left === 1 && e.op === "*") { return e.right } else if (e.left === 0 && e.op === "-") { return unary("-", e.right) } else if (e.left === 0 && ["*", "/", "%"].includes(e.op)) { return 0 } else if (e.op === "**" && e.left === 1) { return 1 } } else if (e.right.constructor === Number) { if (["+", "-"].includes(e.op) && e.right === 0) { return e.left } else if (["*", "/"].includes(e.op) && e.right === 1) { return e.left } else if (e.op === "*" && e.right === 0) { return 0 } else if (e.op === "**" && e.left !== 0 && e.right === 0) { return 1 } } return e }, UnaryExpression(e) { e.operand = optimize(e.operand) if (e.operand.constructor === Number) { if (e.op === "-") { return -e.operand } } return e }, }
And the test:
import assert from "node:assert/strict" import parse from "../src/parser.js" import analyze from "../src/analyzer.js" import optimize from "../src/optimizer.js" import * as core from "../src/core.js" // Make some test cases easier to read const x = core.variable("x", false) const neg = x => core.unary("-", x) const power = (x, y) => core.binary("**", x, y) const cond = (x, y, z) => core.conditional(x, y, z) const sqrt = core.standardLibrary.sqrt const call = (f, args) => core.call(f, args) const letXEq1 = core.variableDeclaration(x, 1) const print = e => core.printStatement(e) const parameterless = name => core.fun(name, 0) const program = p => analyze(parse(p)) const expression = e => program(`let x=1; print ${e};`).statements[1].argument const tests = [ ["folds +", expression("5 + 8"), 13], ["folds -", expression("5 - 8"), -3], ["folds *", expression("5 * 8"), 40], ["folds /", expression("5 / 8"), 0.625], ["folds %", expression("17 % 5"), 2], ["folds **", expression("5 ** 8"), 390625], ["optimizes +0", expression("x + 0"), x], ["optimizes -0", expression("x - 0"), x], ["optimizes *1", expression("x * 1"), x], ["optimizes /1", expression("x / 1"), x], ["optimizes *0", expression("x * 0"), 0], ["optimizes 0*", expression("0 * x"), 0], ["optimizes 0/", expression("0 / x"), 0], ["optimizes 0+", expression("0 + x"), x], ["optimizes 0-", expression("0 - x"), neg(x)], ["optimizes 1*", expression("1 * x"), x], ["folds negation", expression("- 8"), -8], ["optimizes 1**", expression("1 ** x"), 1], ["optimizes **0", expression("x ** 0"), 1], ["optimizes sqrt", expression("sqrt(16)"), 4], ["optimizes sin", expression("sin(0)"), 0], ["optimizes cos", expression("cos(0)"), 1], ["optimizes exp", expression("exp(1)"), Math.E], ["optimizes ln", expression("ln(2)"), Math.LN2], ["optimizes deeply", expression("8 * (-5) + 2 ** 3"), -32], ["optimizes arguments", expression("sqrt(20 + 61)"), 9], ["optimizes true conditionals", expression("1?3:5"), 3], ["optimizes false conditionals", expression("0?3:5"), 5], ["leaves nonoptimizable binaries alone", expression("x ** 5"), power(x, 5)], ["leaves 0**0 alone", expression("0 ** 0"), power(0, 0)], ["leaves nonoptimizable conditionals alone", expression("x?x:2"), cond(x, x, 2)], ["leaves nonoptimizable calls alone", expression("sqrt(x)"), call(sqrt, [x])], ["leaves nonoptimizable negations alone", expression("-x"), neg(x)], [ "optimizes in function body", program("function f() = 1+1;"), core.program([core.functionDeclaration(parameterless("f"), [], 2)]), ], ["removes x=x", program("let x=1; x=x; print(x);"), core.program([letXEq1, print(x)])], [ "optimizes while test", program("while sqrt(25) {}"), core.program([core.whileStatement(5, [])]), ], ] describe("The optimizer", () => { for (const [scenario, before, after] of tests) { it(`${scenario}`, () => { assert.deepEqual(optimize(before), after) }) } })
Our Bella compiler is really a transpiler. We just output JavaScript. Our generator is pretty simple; it walks through the optimized graph emitting JavaScript on the fly. Not too much is interesting, other than perhaps the “decorating” of Bella names by adding numeric suffixes.
// The code generator exports a single function, generate(program), which // accepts a program representation and returns the JavaScript translation // as a string. import { standardLibrary } from "./core.js" export default function generate(program) { // When generating code for statements, we'll accumulate the lines of // the target code here. When we finish generating, we'll join the lines // with newlines and return the result. const output = [] // Variable names in JS will be suffixed with _1, _2, _3, etc. This is // because "for", for example, is a legal variable name in Bella, but // not in JS. So we want to generate something like "for_1". We handle // this by mapping each variable declaration to its suffix. const targetName = (mapping => { return entity => { if (!mapping.has(entity)) { mapping.set(entity, mapping.size + 1) } return `${entity.name}_${mapping.get(entity)}` } })(new Map()) const gen = node => generators?.[node?.kind]?.(node) ?? node const generators = { // Key idea: when generating an expression, just return the JS string; when // generating a statement, write lines of translated JS to the output array. Program(p) { p.statements.forEach(gen) }, VariableDeclaration(d) { output.push(`let ${targetName(d.variable)} = ${gen(d.initializer)};`) }, Variable(v) { if (v === standardLibrary.π) return "Math.PI" return targetName(v) }, FunctionDeclaration(d) { const params = d.params.map(targetName).join(", ") output.push(`function ${targetName(d.fun)}(${params}) {`) output.push(`return ${gen(d.body)};`) output.push("}") }, Function(f) { const standard = new Map([ [standardLibrary.sqrt, "Math.sqrt"], [standardLibrary.sin, "Math.sin"], [standardLibrary.cos, "Math.cos"], [standardLibrary.exp, "Math.exp"], [standardLibrary.ln, "Math.log"], [standardLibrary.hypot, "Math.hypot"], ]).get(f) return standard ?? targetName(f) }, PrintStatement(s) { const argument = gen(s.argument) output.push(`console.log(${argument});`) }, Assignment(s) { output.push(`${targetName(s.target)} = ${gen(s.source)};`) }, WhileStatement(s) { output.push(`while (${gen(s.test)}) {`) s.body.forEach(gen) output.push("}") }, Call(c) { const args = c.args.map(gen) const callee = gen(c.callee) return `${callee}(${args.join(",")})` }, Conditional(e) { return `((${gen(e.test)}) ? (${gen(e.consequent)}) : (${gen(e.alternate)}))` }, BinaryExpression(e) { return `(${gen(e.left)} ${e.op} ${gen(e.right)})` }, UnaryExpression(e) { return `${e.op}(${gen(e.operand)})` }, } gen(program) return output.join("\n") }
And the test:
import assert from "node:assert/strict" import parse from "../src/parser.js" import analyze from "../src/analyzer.js" import optimize from "../src/optimizer.js" import generate from "../src/generator.js" function dedent(s) { return `${s}`.replace(/(?<=\n)\s+/g, "").trim() } const fixtures = [ { name: "small", source: ` let x = 3 * 7; let y = true; y = 5 ** (-x) / 100 > (- x) || false; print((y && y) || (false || (x*2) != 5)); `, expected: dedent` let x_1 = 21; let y_2 = true; y_2 = ((((5 ** -(x_1)) / 100) > -(x_1)) || false); console.log(((y_2 && y_2) || (false || ((x_1 * 2) != 5)))); `, }, { name: "while", source: ` let x = 0; while x < 5 { let y = 0; while y < 5 { print x * y; y = y + 1; } x = x + 1; } `, expected: dedent` let x_1 = 0; while ((x_1 < 5)) { let y_2 = 0; while ((y_2 < 5)) { console.log((x_1 * y_2)); y_2 = (y_2 + 1); } x_1 = (x_1 + 1); } `, }, { name: "functions", source: ` function gcd(x, y) = y == 0 ? x : gcd(y, x % y); print gcd(22, 33); `, expected: dedent` function gcd_3(x_1, y_2) { return (((y_2 == 0)) ? (x_1) : (gcd_3(y_2,(x_1 % y_2)))); } console.log(gcd_3(22,33)); `, }, { name: "standard library", source: ` let x = π; print(sin(x) - cos(x) + exp(x) * ln(x) / hypot(2.3, x)); `, expected: dedent` let x_1 = Math.PI; console.log(((Math.sin(x_1) - Math.cos(x_1)) + ((Math.exp(x_1) * Math.log(x_1)) / Math.hypot(2.3,x_1)))); `, }, ] describe("The code generator", () => { for (const fixture of fixtures) { it(`produces expected js output for the ${fixture.name} program`, () => { const actual = generate(optimize(analyze(parse(fixture.source)))) assert.deepEqual(actual, fixture.expected) }) } })
Here is a module with a function to chain the phases together. It accepts a second parameter to allow you to stop after any phase, in case you want to inspect what the compiler does up to a given point.
import parse from "./parser.js" import analyze from "./analyzer.js" import optimize from "./optimizer.js" import generate from "./generator.js" export default function compile(source, outputType) { if (!["parsed", "analyzed", "optimized", "js"].includes(outputType)) { throw new Error("Unknown output type") } const match = parse(source) if (outputType === "parsed") return "Syntax is ok" const analyzed = analyze(match) if (outputType === "analyzed") return analyzed const optimized = optimize(analyzed) if (outputType === "optimized") return optimized return generate(optimized) }
There’s a light test for this module to make sure the output types are handled:
import assert from "node:assert/strict" import compile from "../src/compiler.js" const sampleProgram = "print(0);" describe("The compiler", () => { it("throws when the output type is missing", done => { assert.throws(() => compile(sampleProgram), /Unknown output type/) done() }) it("throws when the output type is unknown", done => { assert.throws(() => compile(sampleProgram, "no such type"), /Unknown output type/) done() }) it("accepts the parsed option", done => { const compiled = compile(sampleProgram, "parsed") assert(compiled.startsWith("Syntax is ok")) done() }) it("accepts the analyzed option", done => { const compiled = compile(sampleProgram, "analyzed") assert(compiled.kind === "Program") done() }) it("accepts the optimized option", done => { const compiled = compile(sampleProgram, "optimized") assert(compiled.kind === "Program") done() }) it("generates js code when given the js option", done => { const compiled = compile(sampleProgram, "js") assert(compiled.startsWith("console.log(0)")) done() }) })
Here’s a command line script that lets you compile from a file:
#! /usr/bin/env node import * as fs from "node:fs/promises" import stringify from "graph-stringify" import compile from "./compiler.js" const help = `Bella compiler Syntax: bella <filename> <outputType> Prints to stdout according to <outputType>, which must be one of: parsed a message that the program was matched ok by the grammar analyzed the statically analyzed representation optimized the optimized semantically analyzed representation js the translation to JavaScript ` async function compileFromFile(filename, outputType) { try { const buffer = await fs.readFile(filename) const compiled = compile(buffer.toString(), outputType) console.log(stringify(compiled, "kind") || compiled) } catch (e) { console.error(`\u001b[31m${e}\u001b[39m`) process.exitCode = 1 } } if (process.argv.length !== 4) { console.log(help) } else { compileFromFile(process.argv[2], process.argv[3]) }
The graph-stringify module gives us a compact look at the semantic graph, which is convenient for educational, or perhaps just debugging purposes. Let’s see it in action with some example runs. Here’s a sample program:
$ cat examples/small.bella let x=sqrt(9); function f(x) = 3 * x; while(true) { x=3; print(0 ? f(x) : 2); }
Parsing:
$ node src/bella.js examples/small.bella parsed Syntax is ok
Analysis:
$ node src/bella.js examples/small.bella analyzed 1 | Program statements=[#2,#6,#10] 2 | VariableDeclaration variable=#3 initializer=#4 3 | Variable name='x' readOnly=false 4 | Call callee=#5 args=[9] 5 | Function name='sqrt' paramCount=1 6 | FunctionDeclaration fun=#7 params=[#8] body=#9 7 | Function name='f' paramCount=1 8 | Variable name='x' readOnly=true 9 | BinaryExpression op='*' left=3 right=#8 10 | WhileStatement test=true body=[#11,#12] 11 | Assignment target=#3 source=3 12 | PrintStatement argument=#13 13 | Conditional test=0 consequent=#14 alternate=2 14 | Call callee=#7 args=[#3]
This is the textual representation of the first diagram on this page.
Optimization:
$ node src/bella.js examples/small.bella optimized 1 | Program statements=[#2,#4,#8] 2 | VariableDeclaration variable=#3 initializer=3 3 | Variable name='x' readOnly=false 4 | FunctionDeclaration fun=#5 params=[#6] body=#7 5 | Function name='f' paramCount=1 6 | Variable name='x' readOnly=true 7 | BinaryExpression op='*' left=3 right=#6 8 | WhileStatement test=true body=[#9,#10] 9 | Assignment target=#3 source=3 10 | PrintStatement argument=2
And finally, the generated code:
$ node src/bella.js examples/small.bella js let x_1 = 3; function f_3(x_2) { return (3 * x_2); } while (true) { x_1 = 3; console.log(2); }
It’s on GitHubYou can find the complete compiler here.
We’ve covered: