A Bella Compiler

Bella is a little language that makes a great case study for compiler writing. Let’s write a compiler!

Getting Started

Let’s write a complete compiler for the Bella Programming Language. If you’re not familiar with the language, browse the official definition now.

We’ll write the compiler in JavaScript. These notes assume you are somewhat acquainted with Node.js, NPM, Git, Ohm, and the Bella language. If you aren’t, you should still be able to follow along a bit at least.

Scaffolding

A good way to start is to create a project on GitHub, asking GitHub to create the README, the license, and the .gitignore. Select “Node” for the .gitignore template: GitHub knows what to do.

Clone your project, flesh out the README, run npm init -y, and put "type": "module" in your package.json. Add a .prettierrc.json if you like and a project logo in the docs folder. For reference, here’s the complete project structure we’ll be working towards:

  .
  ├── .gitignore           (made by GitHub)
  ├── LICENSE              (made by GitHub)
  ├── README.md            (made by GitHub)
  ├── .prettierrc.json     (totally optional)
  ├── docs
  │   └── bellalogo.png
  ├── examples
  │   └── hello.bella
  ├── package.json         (made by "npm init")
  ├── src
  │   ├── bella.js
  │   ├── compiler.js
  │   ├── bella.ohm
  │   ├── core.js
  │   ├── parser.js
  │   ├── analyzer.js
  │   ├── optimizer.js
  │   └── generator.js
  └── test
      ├── compiler.test.js
      ├── parser.test.js
      ├── analyzer.test.js
      ├── optimizer.test.js
      └── generator.test.js

Run npm install ohm-js graph-stringify and npm install c8 --save-dev, as we will use c8 for test coverage, Ohm for lexing and parsing, and graph-stringify to visualize our abstract syntax trees and semantic graphs. Replace the "test:" portion in the scripts section of package.json to be "test": "c8 node --test test/**/*.js".

For our project architecture, we’ll have a core module that defines the language entities (including those in the standard library), a compiler module that includes a compile function to run the phases in sequence, and a file for a command line interface. The phases of parsing, static analysis, optimization, and code generation each have their own modules. The parser is driven by the grammar, stored in its own Ohm file:

Need more precise instructions?
Although we’re now just looking at this existing compiler, we will review how to make Nodejs as we go. The steps to get the README and the package.json created were:

Create a new project at GitHub, entering the name and description, checking the Public radio button, checking the “Add a README file” box (important), checking “Add .gitigore” and choosing Node for the template (also important), and checking “Choose a license” and selecting MIT.

Clone the repo.

npm init -y

npm install ohm-js graph-stringify

npm install c8 --save-dev

Edit package.json so that the test line reads "test": "c8 node --test test/**/*.js"
Edit package.json to add a line near the top that says "type": "module". (This does not happen automatically! Without this line you get ugly, buggy, old-fashioned, yucky JavaScript.)

Perform any other package.json cleanups you like (author, keywords, etc.)

Create a .prettierrc.json file (optional)

Make sure the Prettier plugin is installed in the editor you are using. Configure things so that you at least “format on save.” You are morally obligated to use a code formatter. Don’t format on your own. It’s not worth it, and it is not professional nor ethical.

Create your logo and save it in a new docs folder.

Make some edits to your README.file, enough to be interesting. Make sure your logo is visible at the top of this file. (READMEs are best with logos!)

Git add, git commit, git push (you should know how to do this already...if not, ask! Then do a Git tutorial.)

Sure, it looks overwhelming the first, time, but once you do it a few times, it becomes second nature.

We’ll add the rest of the files in the notes below.

The Parser

Here’s the Ohm grammar for Bella grammar, copied from the language definition page:

bella.ohm

Bella {
  Program   = Statement+
  Statement = let id "=" Exp ";"                        -- vardec
            | function id Params "=" Exp ";"            -- fundec
            | Exp7_id "=" Exp ";"                       -- assign
            | print Exp ";"                             -- print
            | while Exp Block                           -- while
  Params    = "(" ListOf<id, ","> ")"
  Block     = "{" Statement* "}"

  Exp       = ("-" | "!") Exp7                          -- unary
            | Exp1 "?" Exp1 ":" Exp                     -- ternary
            | Exp1
  Exp1      = Exp1 "||" Exp2                            -- binary
            | Exp2
  Exp2      = Exp2 "&&" Exp3                            -- binary
            | Exp3
  Exp3      = Exp4 ("<="|"<"|"=="|"!="|">="|">") Exp4   -- binary
            | Exp4
  Exp4      = Exp4 ("+" | "-") Exp5                     -- binary
            | Exp5
  Exp5      = Exp5 ("*" | "/" | "%") Exp6               -- binary
            | Exp6
  Exp6      = Exp7 "**" Exp6                            -- binary
            | Exp7
  Exp7      = num
            | true
            | false
            | id "(" ListOf<Exp, ","> ")"               -- call
            | id                                        -- id
            | "(" Exp ")"                               -- parens

  let       = "let" ~idchar
  function  = "function" ~idchar
  while     = "while" ~idchar
  true      = "true" ~idchar
  false     = "false" ~idchar
  print     = "print" ~idchar
  keyword   = let | function | while | true | false
  num       = digit+ ("." digit+)? (("E" | "e") ("+" | "-")? digit+)?
  id        = ~keyword letter idchar*
  idchar    = letter | digit | "_"
  space    += "//" (~"\n" any)*                         -- comment
}

Let’s write the parser first. It’s fairly trivial, because Ohm does the work for us once we write the grammar. We do have to set things up just a little to read the grammar from a file and to invoke the match method on the grammar to produce a match result object:

parser.js

// The parse() function uses Ohm to produce a match object for a given
// source code program, using the grammar in the file bella.ohm.

import * as fs from "node:fs"
import * as ohm from "ohm-js"

const grammar = ohm.grammar(fs.readFileSync("src/bella.ohm"))

// Returns the Ohm match if successful, otherwise throws an error
export default function parse(sourceCode) {
  const match = grammar.match(sourceCode)
  if (!match.succeeded()) throw new Error(match.message)
  return match
}

We’ll return a match result object if the match was successful, and throw an error with the match’s message property already set by Ohm. This message is very convenient; here’s an example of what might go wrong when asked to parse a program with a syntax error:

Error: Line 2, col 14:
  1 | let x = 5;
> 2 | print x - 2 /;
                   ^
  3 | print 5;
Expected "(", a letter, "false", "true", or a digit

To test the parser, we have to both (1) ensure the parser accepts programs that are syntactically correct, and (2) for syntactically incorrect programs, ensure that an error is thrown and that the error message contains expected text. Here’s a quick test. Perhaps it could be a little better, but it’s a good start, and has complete coverage:

parser.test.js

import { describe, it } from "node:test"
import assert from "node:assert/strict"
import parse from "../src/parser.js"

const syntaxChecks = [
  ["all numeric literal forms", "print(8 * 89.123);"],
  ["complex expressions", "print(83 * ((((-((((13 / 21)))))))) + 1 - 0);"],
  ["all unary operators", "print (-3); print (!false);"],
  ["all binary operators", "print x && y || z * 1 / 2 ** 3 + 4 < 5;"],
  ["all arithmetic operators", "let x = (!3) * 2 + 4 - (-7.3) * 8 ** 13 / 1;"],
  ["all relational operators", "let x = 1<(2<=(3==(4!=(5 >= (6>7)))));"],
  ["all logical operators", "let x = true && false || (!false);"],
  ["the conditional operator", "print x ? y : z;"],
  ["end of program inside comment", "print(0); // yay"],
  ["comments with no text are ok", "print(1);//\nprint(0);//"],
  ["non-Latin letters in identifiers", "コンパイラ = 100;"],
]

const syntaxErrors = [
  ["non-letter in an identifier", "ab😭c = 2", /Line 1, col 3/],
  ["malformed number", "x= 2.", /Line 1, col 6/],
  ["missing semicolon", "x = 3 y = 1", /Line 1, col 7/],
  ["a missing right operand", "print(5 -", /Line 1, col 10/],
  ["a non-operator", "print(7 * ((2 _ 3)", /Line 1, col 15/],
  ["an expression starting with a )", "x = );", /Line 1, col 5/],
  ["a statement starting with expression", "x * 5;", /Line 1, col 3/],
  ["an illegal statement on line 2", "print(5);\nx * 5;", /Line 2, col 3/],
  ["a statement starting with a )", "print(5);\n) * 5", /Line 2, col 1/],
  ["an expression starting with a *", "x = * 71;", /Line 1, col 5/],
]

describe("The parser", () => {
  for (const [scenario, source] of syntaxChecks) {
    it(`properly specifies ${scenario}`, () => {
      assert(parse(source).succeeded())
    })
  }
  for (const [scenario, source, errorMessagePattern] of syntaxErrors) {
    it(`does not permit ${scenario}`, () => {
      assert.throws(() => parse(source), errorMessagePattern)
    })
  }
})

As will all JavaScript projects built with Node, you should test frequently. We’ve set things up so that all you have to do is enter:

$ npm test

to do so.

Compiler Core

A focus of compiler writing is understanding the language in terms of its major semantic abstractions, called entities. When we analyze a program, we construct a representation of it using a graph of entities. For example, this Bella program:

let x = sqrt(9);
function f(x) = 3 * x;
while (true) {
  x = 3;
  print 0 ? f(x) : 2;
}

is represented with this entity graph:

By reading the Bella language description, we would come up with the set of needed entities:

Class	Properties	Notes
Program	statements: Statement[]	A program is a sequence of statements.
VariableDeclaration	variable: Variable initializer: Expression	A variable declaration defines a new variable and supplies its initial value.
Variable	name: String mutable: Boolean	A variable is one of two entity kinds in Bella (the other is the function). Variables have a name and can be mutable or immutable. All variables in Bella have a numeric type. Bella is dynamically-typed anyway, so no type constraint is needed.
FunctionDeclaration	fun: Function params: Variable[] body: Statement[]	A function declaration defines a new function entity and supplies it with a body. The parameters can be stored with the declaration, rather than with the function directly, since Bella is simple enough to not need anything more than a parameter count to be stored with the function.
Function	name: String parameterCount: Int	A function is one of two entity kinds in Bella (the other is the variable). Functions have a name. We also keep a parameter count with them because when called, the number of call arguments must equal the number of function parameters. In Bella, we don’t have to store parameter names or types; the count alone suffices.
Assignment	target: Variable source: Expression	Assignment statements assign a source expression to a target variable. Nothing more, nothing less.
WhileStatement	test: Expression body: Statement[]	A while statement has a test expression and a sequence of statements for a body.
PrintStatement	argument: Expression	The Bella print statement prints a single expression.
Call	callee: Function args: Expression[]	A function is called with zero or more expression arguments.
Conditional	test: Expression consequent: Expression alternate: Expression	The conditional expression `x?y:z` features three constituent expressions.
BinaryExpression	op: String left: Expression right: Expression	All expressions with a binary operator are represented by an object of this class.
UnaryExpression	op: String operand: Expression	An expression with a unary operator and a single operand.

A language core should also define objects for the actual variables and functions that are predefined (part of a “standard library”). To keep the compiler lightweight, we’ll use raw JavaScript objects to represent the program representation nodes. Each node will have a distinguished field called kind that serves as the, well, kind of node that it is. (If you were using TypeScript, you could get much more rigorous and define classes for each of the node types.)

core.js

export function program(statements) {
  return { kind: "Program", statements }
}

export function variableDeclaration(variable, initializer) {
  return { kind: "VariableDeclaration", variable, initializer }
}

export function variable(name, mutable = false) {
  return { kind: "Variable", name, mutable }
}

export function functionDeclaration(fun) {
  return { kind: "FunctionDeclaration", fun }
}

export function fun(name, params, body) {
  return { kind: "Function", name, params, body }
}

export function intrinsicFunction(name, params) {
  return { kind: "Function", name, params, intrinsic: true }
}

export function assignment(target, source) {
  return { kind: "Assignment", target, source }
}

export function whileStatement(test, body) {
  return { kind: "WhileStatement", test, body }
}

export function printStatement(argument) {
  return { kind: "PrintStatement", argument }
}

export function call(callee, args) {
  if (callee.intrinsic) {
    // Intrinsics get special operators
    if (callee.params.length === 1) {
      return unary(callee.name, args[0])
    } else {
      // Bella only has unary and binary intrinsic functions, no more
      return binary(callee.name, args[0], args[1])
    }
  }
  // Nonintrinsics get a regular Call node
  return { kind: "Call", callee, args }
}

export function conditional(test, consequent, alternate) {
  return { kind: "Conditional", test, consequent, alternate }
}

export function binary(op, left, right) {
  return { kind: "BinaryExpression", op, left, right }
}

export function unary(op, operand) {
  return { kind: "UnaryExpression", op, operand }
}

export const standardLibrary = Object.freeze({
  π: variable("π", false),
  sqrt: intrinsicFunction("sqrt", [variable("x")]),
  sin: intrinsicFunction("sin", [variable("x")]),
  cos: intrinsicFunction("cos", [variable("x")]),
  exp: intrinsicFunction("exp", [variable("x")]),
  ln: intrinsicFunction("ln", [variable("x")]),
  hypot: intrinsicFunction("hypot", [variable("x"), variable("y")]),
})

Exercise: Discuss the difference between VariableDeclaration and FunctionDeclaration on the one hand and Variable and Function on the other. Why do we need all four?

Static Analysis

Ohm’s match result is a representation of the concrete syntax tree. We’ll want to turn that into an abstract syntax tree using the core objects we’ve just seen. But while converting the CST into an AST, we’ll need to also perform a number of static contextual checks, that cannot be captured in the grammar, such as:

An identifier cannot be used unless it is one of the built-in identifiers or has been previously declared in the current scope or any ancestor of the current scope.
All function calls must accept the proper number of arguments.
The built-in identifiers cannot be assigned to (they are immutable).
Identifiers declared as functions can only be called, not used in any other context.
Identifiers declared as numbers cannot be called.

These checks are made in the static analysis (also known as the semantic analysis) phase of the compiler front-end, so its nicely handled by context objects. Context objects keep track of the local variables that have been defined in the current scope. Because Bella’s scopes are nested, each context object will hold a reference to its parent context.

The initial scope contains the standard library entities, and new scopes are created for functions.

Earlier, we saw how for failed matches, Ohm’s match result object contained a very convenient message property showing contextual information for the error. For errors discovered in our analyzer, we have to explicitly invoke the getLineAndColumnMessage method on parse tree nodes. One way to do this is to make a special validation function through which all errors pass. Whenever our analyzer detects an error, it passes the parse tree node as a parameter to the validation function, and the validator packages up a nice error message and throws.

Error: Line 2, col 1:
  1 | print(exp(π));
> 2 | π = 3;
      ^
  3 | 
The identifier π is read only

Here’s the analyzer:

analyzer.js

// The semantic analyzer exports a single function, analyze(match), that
// accepts a grammar match object (the CST) from Ohm and produces the
// internal representation of the program (pretty close to what is usually
// called the AST). This representation also includes entities from the
// standard library, as needed.

import * as core from "./core.js"

class Context {
  constructor({ parent, locals = {} }) {
    this.parent = parent
    this.locals = new Map(Object.entries(locals))
  }
  add(name, entity) {
    this.locals.set(name, entity)
  }
  lookup(name) {
    return this.locals.get(name) || this.parent?.lookup(name)
  }
}

export default function analyze(match) {
  // Track the context manually via a simple variable. The initial context
  // contains the mappings from the standard library. Add to this context
  // as necessary. When needing to descent into a new scope, create a new
  // context with the current context as its parent. When leaving a scope,
  // reset this variable to the parent context.
  let context = new Context({ locals: core.standardLibrary })

  // The single gate for error checking. Pass in a condition that must be true.
  // Use errorLocation to give contextual information about the error that will
  // appear: this should be an object whose "at" property is a parse tree node.
  // Ohm's getLineAndColumnMessage will be used to prefix the error message.
  function must(condition, message, errorLocation) {
    if (!condition) {
      const prefix = errorLocation.at.source.getLineAndColumnMessage()
      throw new Error(`${prefix}${message}`)
    }
  }

  function mustNotAlreadyBeDeclared(name, at) {
    must(!context.locals.has(name), `Identifier ${name} already declared`, at)
  }

  function mustHaveBeenFound(entity, name, at) {
    must(entity, `Identifier ${name} not declared`, at)
  }

  function mustBeAVariable(entity, at) {
    // Bella has two kinds of entities: variables and functions.
    must(entity?.kind === "Variable", `Functions can not appear here`, at)
  }

  function mustBeAFunction(entity, at) {
    must(entity?.kind === "Function", `${entity.name} is not a function`, at)
  }

  function mustBeMutable(entity, at) {
    must(entity.mutable, `${entity.name} is read only`, at)
  }

  function mustHaveCorrectArgumentCount(args, callee, at) {
    const argCount = args.length
    const paramCount = callee.params.length
    const equalCount = argCount === paramCount
    must(equalCount, `${paramCount} argument(s) required but ${argCount} passed`, at)
  }

  const builder = match.matcher.grammar.createSemantics().addOperation("rep", {
    Program(statements) {
      return core.program(statements.children.map(s => s.rep()))
    },

    Statement_vardec(_let, id, _eq, exp, _semicolon) {
      // Analyze the initializer *before* adding the variable to the context,
      // because we don't want the variable to come into scope until after
      // the declaration. That is, "let x=x;" should be an error (unless x
      // was already defined in an outer scope.)
      mustNotAlreadyBeDeclared(id.sourceString, { at: id })
      const initializer = exp.rep()
      const variable = core.variable(id.sourceString, true)
      context.add(id.sourceString, variable)
      return core.variableDeclaration(variable, initializer)
    },

    Statement_fundec(_fun, id, parameters, _equals, exp, _semicolon) {
      // Add new function object to the context right away to allow recursion.
      mustNotAlreadyBeDeclared(id.sourceString, { at: id })
      const fun = core.fun(id.sourceString)
      context.add(id.sourceString, fun)

      // The params and body must be analyzed in a new context.
      context = new Context({ parent: context })
      fun.params = parameters.rep()
      fun.body = exp.rep()
      context = context.parent

      return core.functionDeclaration(fun)
    },

    Params(_open, idList, _close) {
      return idList.asIteration().children.map(id => {
        const param = core.variable(id.sourceString, false)
        // All of the parameters have to be unique
        mustNotAlreadyBeDeclared(id.sourceString, { at: id })
        context.add(id.sourceString, param)
        return param
      })
    },

    Statement_assign(id, _eq, exp, _semicolon) {
      const target = id.rep()
      mustBeMutable(target, { at: id })
      return core.assignment(target, exp.rep())
    },

    Statement_print(_print, exp, _semicolon) {
      return core.printStatement(exp.rep())
    },

    Statement_while(_while, exp, block) {
      return core.whileStatement(exp.rep(), block.rep())
    },

    Block(_open, statements, _close) {
      return statements.children.map(s => s.rep())
    },

    Exp_unary(op, exp) {
      return core.unary(op.sourceString, exp.rep())
    },

    Exp_ternary(exp1, _questionMark, exp2, _colon, exp3) {
      return core.conditional(exp1.rep(), exp2.rep(), exp3.rep())
    },

    Exp1_binary(exp1, op, exp2) {
      return core.binary(op.sourceString, exp1.rep(), exp2.rep())
    },

    Exp2_binary(exp1, op, exp2) {
      return core.binary(op.sourceString, exp1.rep(), exp2.rep())
    },

    Exp3_binary(exp1, op, exp2) {
      return core.binary(op.sourceString, exp1.rep(), exp2.rep())
    },

    Exp4_binary(exp1, op, exp2) {
      return core.binary(op.sourceString, exp1.rep(), exp2.rep())
    },

    Exp5_binary(exp1, op, exp2) {
      return core.binary(op.sourceString, exp1.rep(), exp2.rep())
    },

    Exp6_binary(exp1, op, exp2) {
      return core.binary(op.sourceString, exp1.rep(), exp2.rep())
    },

    Exp7_parens(_open, exp, _close) {
      return exp.rep()
    },

    Exp7_call(id, open, expList, _close) {
      // ids used in calls must have already been declared and must be
      // bound to function entities, not to variable entities.
      const callee = context.lookup(id.sourceString)
      mustHaveBeenFound(callee, id.sourceString, { at: id })
      mustBeAFunction(callee, { at: id })
      const args = expList.asIteration().children.map(arg => arg.rep())
      mustHaveCorrectArgumentCount(args, callee, { at: open })
      return core.call(callee, args)
    },

    Exp7_id(id) {
      // ids used in expressions must have been already declared and must
      // be bound to variable entities, not function entities.
      const entity = context.lookup(id.sourceString)
      mustHaveBeenFound(entity, id.sourceString, { at: id })
      mustBeAVariable(entity, { at: id })
      return entity
    },

    true(_) {
      return true
    },

    false(_) {
      return false
    },

    num(_whole, _point, _fraction, _e, _sign, _exponent) {
      return Number(this.sourceString)
    },
  })

  return builder(match).rep()
}

And here’s a test:

analyzer.test.js

import { describe, it } from "node:test"
import assert from "node:assert/strict"
import parse from "../src/parser.js"
import analyze from "../src/analyzer.js"
import * as core from "../src/core.js"

const semanticChecks = [
  ["variables can be printed", "let x = 1; print x;"],
  ["variables can be reassigned", "let x = 1; x = x * 5 / ((-3) + x);"],
  ["all predefined identifiers", "print ln(sqrt(sin(cos(hypot(π,1) + exp(5.5E2)))));"],
]

const semanticErrors = [
  ["using undeclared identifiers", "print(x);", /Identifier x not declared/],
  ["a variable used as function", "x = 1; x(2);", /Expected "="/],
  ["a function used as variable", "print(sin + 1);", /Functions can not appear here/],
  ["re-declared identifier", "let x = 1; let x = 2;", /Identifier x already declared/],
  ["an attempt to write a read-only var", "π = 3;", /π is read only/],
  ["too few arguments", "print(sin());", /1 argument\(s\) required but 0 passed/],
  ["too many arguments", "print(sin(5, 10));", /1 argument\(s\) required but 2 passed/],
]

const sample = "let x=sqrt(9);function f(x)=3*x;while(true){x=3;print(0?f(x):2);}"

describe("The analyzer", () => {
  for (const [scenario, source] of semanticChecks) {
    it(`recognizes ${scenario}`, () => {
      assert.ok(analyze(parse(source)))
    })
  }
  for (const [scenario, source, errorMessagePattern] of semanticErrors) {
    it(`throws on ${scenario}`, () => {
      assert.throws(() => analyze(parse(source)), errorMessagePattern)
    })
  }
  it(`produces the expected graph for the simple sample program`, () => {
    const program = analyze(parse(sample))
    let x = core.variable("x", true)
    let localX = core.variable("x", false)
    let f = core.fun("f", [localX], core.binary("*", 3, localX))
    assert.deepEqual(
      program,
      core.program([
        core.variableDeclaration(x, core.call(core.standardLibrary.sqrt, [9])),
        core.functionDeclaration(f),
        core.whileStatement(true, [
          core.assignment(x, 3),
          core.printStatement(core.conditional(0, core.call(f, [x]), 2)),
        ]),
      ])
    )
  })
})

We were careful to check both statically valid and invalid code snippets, as well as checking that the analyzer could produce an expected semantic graph for a non-trivial program.

Optimization

After analysis, we still have a faithful representation of the source language syntax, we have all the useful line and column numbers, and have representations of every function and every variable, whether used or not.

The next phase collapses things quite a bit. It gets rid of unreachable and dead code, and performs a bunch of algebraic simplifications where it can. For our Bella compiler, we have the following opportunities:

Constant folding (e.g., replace 3 + 5 with 8)
Algebraic simplifications (e.g., adding zero, multiplying by 1, etc.)
Remove assignments to self (e.g., x = x;)

Run-time error handling?

Sometimes, compilers will keep the information about source code with line and column numbers around and actually bundle this into the translated code, so that any errors that occur at run time can be reported in great detail.

However, in this Bella compiler, we won’t do this. Bella is so static, that all error reporting has been done in the analyzer. We don’t ever have run-time errors.

Here’s the optimized version of our running example:

Because of the way the optimizer collapses the semantic graph, each optimization method returns a graph fragment. Here’s the module:

optimizer.js

// The optimizer module exports a single function, optimize(node), to perform
// machine-independent optimizations on the analyzed semantic representation.
//
// The only optimizations supported here are:
//
//   - assignments to self (x = x) turn into no-ops
//   - constant folding
//   - some strength reductions (+0, -0, *0, *1, etc.)
//   - Conditionals with constant tests collapse into a single arm

import { unary } from "./core.js"

export default function optimize(node) {
  return optimizers?.[node.kind]?.(node) ?? node
}

const optimizers = {
  Program(p) {
    p.statements = p.statements.flatMap(optimize)
    return p
  },
  VariableDeclaration(d) {
    d.initializer = optimize(d.initializer)
    return d
  },
  FunctionDeclaration(d) {
    d.fun = optimize(d.fun)
    return d
  },
  Function(f) {
    if (f.body) f.body = optimize(f.body)
    return f
  },
  Assignment(s) {
    s.source = optimize(s.source)
    if (s.source === s.target) {
      return []
    }
    return s
  },
  WhileStatement(s) {
    s.test = optimize(s.test)
    s.body = s.body.flatMap(optimize)
    return s
  },
  PrintStatement(s) {
    s.argument = optimize(s.argument)
    return s
  },
  Call(c) {
    c.args = c.args.map(optimize)
    return c
  },
  Conditional(c) {
    c.test = optimize(c.test)
    c.consequent = optimize(c.consequent)
    c.alternate = optimize(c.alternate)
    if (c.test.constructor === Number || c.test.constructor === Boolean) {
      return c.test ? c.consequent : c.alternate
    }
    return c
  },
  BinaryExpression(e) {
    e.left = optimize(e.left)
    e.right = optimize(e.right)
    if (e.left.constructor === Number) {
      if (e.right.constructor === Number) {
        if (e.op === "+") {
          return e.left + e.right
        } else if (e.op === "-") {
          return e.left - e.right
        } else if (e.op === "*") {
          return e.left * e.right
        } else if (e.op === "/") {
          return e.left / e.right
        } else if (e.op === "%") {
          return e.left % e.right
        } else if (e.op === "**" && !(e.left === 0 && e.right == 0)) {
          return e.left ** e.right
        } else if (e.op === "hypot") {
          return Math.hypot(e.left, e.right)
        }
      } else if (e.left === 0 && e.op === "+") {
        return e.right
      } else if (e.left === 1 && e.op === "*") {
        return e.right
      } else if (e.left === 0 && e.op === "-") {
        return unary("-", e.right)
      } else if (e.left === 0 && ["*", "/", "%"].includes(e.op)) {
        return 0
      } else if (e.op === "**" && e.left === 1) {
        return 1
      }
    } else if (e.right.constructor === Number) {
      if (["+", "-"].includes(e.op) && e.right === 0) {
        return e.left
      } else if (["*", "/"].includes(e.op) && e.right === 1) {
        return e.left
      } else if (e.op === "*" && e.right === 0) {
        return 0
      } else if (e.op === "**" && e.left !== 0 && e.right === 0) {
        return 1
      }
    }
    return e
  },
  UnaryExpression(e) {
    e.operand = optimize(e.operand)
    if (e.operand.constructor === Number) {
      if (e.op === "-") return -e.operand
      if (e.op === "sqrt") return Math.sqrt(e.operand)
      if (e.op === "sin") return Math.sin(e.operand)
      if (e.op === "cos") return Math.cos(e.operand)
      if (e.op === "ln") return Math.log(e.operand)
      if (e.op === "exp") return Math.exp(e.operand)
    }
    return e
  },
}

And the test:

optimizer.test.js

import { describe, it } from "node:test"
import assert from "node:assert/strict"
import parse from "../src/parser.js"
import analyze from "../src/analyzer.js"
import optimize from "../src/optimizer.js"
import * as core from "../src/core.js"

// Make some test cases easier to read
const x = core.variable("x", true)
const neg = x => core.unary("-", x)
const power = (x, y) => core.binary("**", x, y)
const cond = (x, y, z) => core.conditional(x, y, z)
const sqrt = core.standardLibrary.sqrt
const call = (f, args) => core.call(f, args)
const letXEq1 = core.variableDeclaration(x, 1)
const print = e => core.printStatement(e)
const parameterless = (name, body) => core.fun(name, [], body)
const program = p => analyze(parse(p))
const expression = e => program(`let x=1; print ${e};`).statements[1].argument

const tests = [
  ["folds +", expression("5 + 8"), 13],
  ["folds -", expression("5 - 8"), -3],
  ["folds *", expression("5 * 8"), 40],
  ["folds /", expression("5 / 8"), 0.625],
  ["folds %", expression("17 % 5"), 2],
  ["folds **", expression("5 ** 8"), 390625],
  ["optimizes +0", expression("x + 0"), x],
  ["optimizes -0", expression("x - 0"), x],
  ["optimizes *1", expression("x * 1"), x],
  ["optimizes /1", expression("x / 1"), x],
  ["optimizes *0", expression("x * 0"), 0],
  ["optimizes 0*", expression("0 * x"), 0],
  ["optimizes 0/", expression("0 / x"), 0],
  ["optimizes 0+", expression("0 + x"), x],
  ["optimizes 0-", expression("0 - x"), neg(x)],
  ["optimizes 1*", expression("1 * x"), x],
  ["folds negation", expression("- 8"), -8],
  ["optimizes 1**", expression("1 ** x"), 1],
  ["optimizes **0", expression("x ** 0"), 1],
  ["optimizes sqrt", expression("sqrt(16)"), 4],
  ["optimizes sin", expression("sin(0)"), 0],
  ["optimizes cos", expression("cos(0)"), 1],
  ["optimizes exp", expression("exp(1)"), Math.E],
  ["optimizes ln", expression("ln(2)"), Math.LN2],
  ["optimizes hypot", expression("hypot(9, 12)"), 15],
  ["optimizes deeply", expression("8 * (-5) + 2 ** 3"), -32],
  ["optimizes arguments", expression("sqrt(20 + 61)"), 9],
  ["optimizes true conditionals", expression("1?3:5"), 3],
  ["optimizes false conditionals", expression("0?3:5"), 5],
  ["leaves nonoptimizable binaries alone", expression("x ** 5"), power(x, 5)],
  ["leaves 0**0 alone", expression("0 ** 0"), power(0, 0)],
  ["leaves nonoptimizable conditionals alone", expression("x?x:2"), cond(x, x, 2)],
  ["leaves nonoptimizable calls alone", expression("sqrt(x)"), call(sqrt, [x])],
  ["leaves nonoptimizable negations alone", expression("-x"), neg(x)],
  [
    "optimizes in function body",
    program("function f() = 1+1;"),
    core.program([core.functionDeclaration(parameterless("f", 2))]),
  ],
  ["removes x=x", program("let x=1; x=x; print(x);"), core.program([letXEq1, print(x)])],
  [
    "optimizes while test",
    program("while sqrt(25) {}"),
    core.program([core.whileStatement(5, [])]),
  ],
]

describe("The optimizer", () => {
  for (const [scenario, before, after] of tests) {
    it(`${scenario}`, () => {
      assert.deepEqual(optimize(before), after)
    })
  }
})

Code Generation

Our Bella compiler is really a transpiler. We just output JavaScript. Our generator is pretty simple; it walks through the optimized graph emitting JavaScript on the fly. Not too much is interesting, other than perhaps the “decorating” of Bella names by adding numeric suffixes.

generator.js

// The code generator exports a single function, generate(program), which
// accepts a program representation and returns the JavaScript translation
// as a string.

import { standardLibrary } from "./core.js"

export default function generate(program) {
  // When generating code for statements, we'll accumulate the lines of
  // the target code here. When we finish generating, we'll join the lines
  // with newlines and return the result.
  const output = []

  // Variable names in JS will be suffixed with _1, _2, _3, etc. This is
  // because "for", for example, is a legal variable name in Bella, but
  // not in JS. So we want to generate something like "for_1". We handle
  // this by mapping each variable declaration to its suffix.
  const targetName = (mapping => {
    return entity => {
      if (!mapping.has(entity)) {
        mapping.set(entity, mapping.size + 1)
      }
      return `${entity.name}_${mapping.get(entity)}`
    }
  })(new Map())

  const gen = node => generators?.[node?.kind]?.(node) ?? node

  const generators = {
    // Key idea: when generating an expression, just return the JS string; when
    // generating a statement, write lines of translated JS to the output array.
    Program(p) {
      p.statements.forEach(gen)
    },
    VariableDeclaration(d) {
      output.push(`let ${targetName(d.variable)} = ${gen(d.initializer)};`)
    },
    Variable(v) {
      if (v === standardLibrary.π) return "Math.PI"
      return targetName(v)
    },
    FunctionDeclaration(d) {
      const params = d.fun.params.map(targetName).join(", ")
      output.push(`function ${targetName(d.fun)}(${params}) {`)
      output.push(`return ${gen(d.fun.body)};`)
      output.push("}")
    },
    Function(f) {
      return targetName(f)
    },
    PrintStatement(s) {
      output.push(`console.log(${gen(s.argument)});`)
    },
    Assignment(s) {
      output.push(`${targetName(s.target)} = ${gen(s.source)};`)
    },
    WhileStatement(s) {
      output.push(`while (${gen(s.test)}) {`)
      s.body.forEach(gen)
      output.push("}")
    },
    Call(c) {
      return `${gen(c.callee)}(${c.args.map(gen).join(",")})`
    },
    Conditional(e) {
      return `((${gen(e.test)}) ? (${gen(e.consequent)}) : (${gen(e.alternate)}))`
    },
    BinaryExpression(e) {
      if (e.op === "hypot") return `Math.hypot(${gen(e.left)},${gen(e.right)})`
      return `(${gen(e.left)} ${e.op} ${gen(e.right)})`
    },
    UnaryExpression(e) {
      const operand = gen(e.operand)
      if (e.op === "sqrt") return `Math.sqrt(${operand})`
      if (e.op === "sin") return `Math.sin(${operand})`
      if (e.op === "cos") return `Math.cos(${operand})`
      if (e.op === "exp") return `Math.exp(${operand})`
      if (e.op === "ln") return `Math.log(${operand})`
      return `${e.op}(${operand})`
    },
  }

  gen(program)
  return output.join("\n")
}

And the test:

generator.test.js

import { describe, it } from "node:test"
import assert from "node:assert/strict"
import parse from "../src/parser.js"
import analyze from "../src/analyzer.js"
import optimize from "../src/optimizer.js"
import generate from "../src/generator.js"

function dedent(s) {
  return `${s}`.replace(/(?<=\n)\s+/g, "").trim()
}

const fixtures = [
  {
    name: "small",
    source: `
      let x = 3 * 7;
      let y = true;
      y = 5 ** (-x) / 100 > (- x) || false;
      print((y && y) || (false || (x*2) != 5));
    `,
    expected: dedent`
      let x_1 = 21;
      let y_2 = true;
      y_2 = ((((5 ** -(x_1)) / 100) > -(x_1)) || false);
      console.log(((y_2 && y_2) || (false || ((x_1 * 2) != 5))));
    `,
  },
  {
    name: "while",
    source: `
      let x = 0;
      while x < 5 {
        let y = 0;
        while y < 5 {
          print x * sqrt(y);
          y = y + 1;
        }
        x = x + 1;
      }
    `,
    expected: dedent`
      let x_1 = 0;
      while ((x_1 < 5)) {
      let y_2 = 0;
      while ((y_2 < 5)) {
      console.log((x_1 * Math.sqrt(y_2)));
      y_2 = (y_2 + 1);
      }
      x_1 = (x_1 + 1);
      }
    `,
  },
  {
    name: "functions",
    source: `
      function gcd(x, y) = y == 0 ? x : gcd(y, x % y);
      print gcd(22, 33);
    `,
    expected: dedent`
      function gcd_3(x_1, y_2) {
        return (((y_2 == 0)) ? (x_1) : (gcd_3(y_2,(x_1 % y_2))));
      }
      console.log(gcd_3(22,33));
    `,
  },
  {
    name: "standard library",
    source: `
      let x = π;
      print(sin(x) - cos(x) + exp(x) * ln(x) / hypot(2.3, x));
    `,
    expected: dedent`
      let x_1 = Math.PI;
      console.log(((Math.sin(x_1) - Math.cos(x_1)) + ((Math.exp(x_1) * Math.log(x_1)) / Math.hypot(2.3,x_1))));
    `,
  },
]

describe("The code generator", () => {
  for (const fixture of fixtures) {
    it(`produces expected js output for the ${fixture.name} program`, () => {
      const actual = generate(optimize(analyze(parse(fixture.source))))
      assert.deepEqual(actual, fixture.expected)
    })
  }
})

Full Compiler

Here is a module with a function to chain the phases together. It accepts a second parameter to allow you to stop after any phase, in case you want to inspect what the compiler does up to a given point.

compiler.js

import parse from "./parser.js"
import analyze from "./analyzer.js"
import optimize from "./optimizer.js"
import generate from "./generator.js"

export default function compile(source, outputType) {
  if (!["parsed", "analyzed", "optimized", "js"].includes(outputType)) {
    throw new Error("Unknown output type")
  }
  const match = parse(source)
  if (outputType === "parsed") return "Syntax is ok"
  const analyzed = analyze(match)
  if (outputType === "analyzed") return analyzed
  const optimized = optimize(analyzed)
  if (outputType === "optimized") return optimized
  return generate(optimized)
}

There’s a light test for this module to make sure the output types are handled:

compiler.test.js

import { describe, it } from "node:test"
import assert from "node:assert/strict"
import compile from "../src/compiler.js"

const sampleProgram = "print(0);"

describe("The compiler", () => {
  it("throws when the output type is missing", () => {
    assert.throws(() => compile(sampleProgram), /Unknown output type/)
  })
  it("throws when the output type is unknown", () => {
    assert.throws(() => compile(sampleProgram, "no such type"), /Unknown output type/)
  })
  it("accepts the parsed option", () => {
    const compiled = compile(sampleProgram, "parsed")
    assert(compiled.startsWith("Syntax is ok"))
  })
  it("accepts the analyzed option", () => {
    const compiled = compile(sampleProgram, "analyzed")
    assert(compiled.kind === "Program")
  })
  it("accepts the optimized option", () => {
    const compiled = compile(sampleProgram, "optimized")
    assert(compiled.kind === "Program")
  })
  it("generates js code when given the js option", () => {
    const compiled = compile(sampleProgram, "js")
    assert(compiled.startsWith("console.log(0)"))
  })
})

Command Line Script

Here’s a command line script that lets you compile from a file:

bella.js

#! /usr/bin/env node

import * as fs from "node:fs/promises"
import stringify from "graph-stringify"
import compile from "./compiler.js"

const help = `Bella compiler

Syntax: bella <filename> <outputType>

Prints to stdout according to <outputType>, which must be one of:

  parsed     a message that the program was matched ok by the grammar
  analyzed   the statically analyzed representation
  optimized  the optimized semantically analyzed representation
  js         the translation to JavaScript
`

async function compileFromFile(filename, outputType) {
  try {
    const buffer = await fs.readFile(filename)
    const compiled = compile(buffer.toString(), outputType)
    console.log(stringify(compiled, "kind") || compiled)
  } catch (e) {
    console.error(`\u001b[31m${e}\u001b[39m`)
    process.exitCode = 1
  }
}

if (process.argv.length !== 4) {
  console.log(help)
} else {
  compileFromFile(process.argv[2], process.argv[3])
}

The graph-stringify module gives us a compact look at the semantic graph, which is convenient for educational, or perhaps just debugging purposes. Let’s see it in action with some example runs. Here’s a sample program:

$ cat examples/small.bella
let x=sqrt(9);
function f(x) = 3 * x;
while(true) {
  x=3;
  print(0 ? f(x) : 2);
}

Parsing:

$ node src/bella.js examples/small.bella parsed
Syntax is ok

Analysis:

$ node src/bella.js examples/small.bella analyzed
   1 | Program statements=[#2,#6,#10]
   2 | VariableDeclaration variable=#3 initializer=#4
   3 | Variable name='x' mutable=true
   4 | Call callee=#5 args=[9]
   5 | Function name='sqrt' paramCount=1
   6 | FunctionDeclaration fun=#7 params=[#8] body=#9
   7 | Function name='f' paramCount=1
   8 | Variable name='x' mutable=false
   9 | BinaryExpression op='*' left=3 right=#8
  10 | WhileStatement test=true body=[#11,#12]
  11 | Assignment target=#3 source=3
  12 | PrintStatement argument=#13
  13 | Conditional test=0 consequent=#14 alternate=2
  14 | Call callee=#7 args=[#3]

This is the textual representation of the first diagram on this page.

Exercise: Spent time studying the connection between the graph-stringify output above and the picture we saw earlier, until you get a good feel for the output.

Exercise: (Challenge) Build something that can render entity graphs as pictures. Perhaps make them interactive as well.

Optimization:

$ node src/bella.js examples/small.bella optimized
   1 | Program statements=[#2,#4,#8]
   2 | VariableDeclaration variable=#3 initializer=3
   3 | Variable name='x' mutable=true
   4 | FunctionDeclaration fun=#5 params=[#6] body=#7
   5 | Function name='f' paramCount=1
   6 | Variable name='x' mutable=false
   7 | BinaryExpression op='*' left=3 right=#6
   8 | WhileStatement test=true body=[#9,#10]
   9 | Assignment target=#3 source=3
  10 | PrintStatement argument=2

Exercise: Check this against the optimized graph pictured earlier in these notes.

And finally, the generated code:

$ node src/bella.js examples/small.bella js
let x_1 = 3;
function f_3(x_2) {
return (3 * x_2);
}
while (true) {
x_1 = 3;
console.log(2);
}

It’s on GitHub
You can find the complete compiler here.

Summary

We’ve covered:

How to set up a compiler project with Node, NPM, Ohm, and c8
A Bella Compiler