Syntax Design

One of the most recognizable features of a languages is its syntax. What are some of the things about syntax that matter? What questions might you ask if you were creating a syntax for your own language?

Motivation

A programming language gives us a way structure our thoughts. Each program, has a kind of internal structure, for example:

How can we capture this structure? One way is directly, via pictures. The elements in such a language can be high-level or low-level:

snap blueprints

But text is more dense than pictures, and more directly machine readable at times (no computer vision machine learning necessary). Even if your language was expressed with pictures, there will be an underlying byte-array, or string, representation for storage and transmission. So our focus from now on will be on text.

Exercise: But don’t miss Bret Victor’s The Future of Programming for thoughts on programming visually.

Bracketing

We really do want our languages to be close to the abstract syntax tree that represents them. We didn't always do this: in the early days programs were lists of statements with jumps, otherwise known as gotos. But how do we capture trees as text?

There are two main approaches. The first is to just give a direct representation of the syntax tree. The other is to be creative with syntax, making something that a human might possibly like to read.

Let’s play around with the above tree and see what we come up with.

S-Expressions

The usual direct representation of trees in strings is that of S-expressions—parenthesized forms such as (A B C D) where A is the root and B, C, and D are A’s (ordered) children. The children can be primitive values or even...wait for it...trees.

(function f (params x y)
    (block
        (var b 3)
        (if (< x (* a (+ b 7)))
            (while found
                (block
                    (for i (range 2 5) (print i))
                    (if (== a 2) (break))))
            (= p (cos 5)))
        (return (/ a 5))))

Beautiful, right? Well-loved languages like LISP, Scheme, Racket, and Clojure use S-expressions (though with a few little extensions here and there so as not to drive programmers totally insane).

Notice in the example above, some of the objects aren’t actually evaluated. We’re seeing the distinction between symbols and value expressions. Can we make this explicit in the syntax? Yes we can:

(function :f (params :x :y)
    (block
        (var :b 3)
        (if (< x (* a (+ b 7)))
            (while found
                (block
                    (for :i (range 2 5) (print i))
                    (if (== a 2) (break))))
            (= p (cos 5)))
        (return (/ a 5))))

Here :x means just “the symbol x.” It’s just a symbol, nothing more, nothing less. On the other hand, x is an expression that means “go look up the value associated with the symbol :x an use this value.”

That is a pretty huge difference, isn’t it?

Exercise: Explain this difference to your neighbor.

XML Style

The tree can be given directly in XML, too. This is ridiculously verbose, not only because we have some markup characters, but because we have to invent elements for leaf nodes. Why are we looking at this? Not to ruin your day, but rather, to show, viscerally, that it’s possible to show hierarchical structures as strings in more than one way.

Please do not do anything like this.

<function name="f" params="x,y">
  <block>
    <var id="b"><intlit value="3"/></var>
    <if>
      <less>
        <ref id="x"/>
        <times>
          <ref id="a"/>
          <plus><ref id="b"/><intlit value="7"></plus>
        </times>
      </less>
      <while>
        <ref id="found"/>
        <block>
          <for var="i">
            <range><intlit value="2"/><intlit value="5"/></range>
            <print><ref id="i"/></print>
          </for>
          <if>
            <eq><ref id="a"/><intlit value="2"/></eq>
            <break/>
          </if>
        </block>
      </while>
      <assign>
        <ref id="p"/>
        <cos><intlit value="5"/></cos>
      </assign>
    </if>
    <return><intlit value="5"/></return>
  </block>
</function>

A slight simplification is to make something like this:

Function { 
    name: "f", 
    params: ["x", "y"], 
    body: Block { statements: [
        Var { id: "b", init: 3 },
        If {
            condition: Less { 
                left: "x", 
                right: Multiply { 
                    left: "a", right: Plus { left: "b", right: 7 } } },
            truePart: While {
                condition: "found", 
                body: Block { statements: [
                    For {
                        var: "i",
                        range: Range { low: 2, high: 5 },
                        body: Print { args: ["i"] } },
                    If {
                        condition: Eq { left: "a", right: 2 },
                        truePart: Break { } } ] } },
            falsePart: Assign { target: "p", source: Cos { args: [ 5 ] } } },
        Return { value: 5 } ] } }

Indentation

Okay, let’s move away from direct representations and use some “real syntax.” By real syntax we mean creatively arranging symbols in a pleasing and readable way but still having it be possible to unambiguously recapture the tree structure. How about indenting? (The technical term for this approach is the Off-side rule.)

function f(x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            if a == 2
                break
    else
        p = cos(5)
    return a / 5

Curly Braces

The use of curly braces to define subtrees is, for some reason, very popular.

function f(x, y) {
    var b = 3;
    if x < a * (b + 7) {
        while found {
            for i in 2..5 {
                print i;
            }
            if a == 2 {
                break;
            }
        }
    } else {
        p = cos(5);
    }
    return a / 5;
}

Terminating Keywords

Rather than marking the beginning and end of a block, as is done with the curly brace style and its equivalents, we can get away with just marking the end. It’s true! This can be done simply with the word end:

function f(x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            end
            if a == 2
                break
            end
        end
    else
        p = cos(5)
    end
    return a / 5
end

or by spelling the initial word backwards (yes, laugh if you want but this is a real thing):

function f(x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            rof
            if a == 2
                break
            fi
        elihw
    else
        p = cos(5)
    fi
    return a / 5
noitcnuf

Postfix Style

A postfix style representation allows pretty simple evaluation on a stack machine. Here it’s crucial to distinguish symbols from variables—it’s pretty hard (impossible in general?) to distinguish a defining occurrence of an identifier from a using occurrence. We sometimes have to bundle code blocks; here I used square brackets for that purpose:

[ :x param :y param ]
[
    3 :b var
    x a b 7 + * <
    [
        found
        [
            :i 2 5 [i print] for
            2 a == [break] [] if
        ]
    while]
    [5 cos p assign]
    if
    a 5 / return
] :f function

How do you make sense of this? Evaluate the code directly on a stack. When you see a value, including a block, push it on the stack. When you see an operator, apply it to the element(s) on the stack top.

Exercise: If you’ve ever taken a compilers class, or written a compiler or interpreter for fun, write an interpreter for the little “language” above.

Sugary Functional Style

Did you know it’s possible for everything in a language to be the composition of function calls, and yet look like it has statements?

function f(x, y) =
    let
        b = 3
    in
        if x < a * (b + 7) then
            while found do (
                for i in 2..5
                    print i
                ;
                if a == 2 then
                    break
                else
                    nil
            )
        else
            p = cos(5)
        ;
        a / 5
    end

Delimiters

How to separate one construct from another is a really big issue in syntax design, believe it or not. We can identify two main classes of languages: those in which newlines are significant and those in which they are not.

“Insignificant” Newlines

In many languages, newlines are just like any other whitespace character (except for minor exceptions such as single-line comments and single-line string literals. Then, unless you have an S-Expression-based syntax as in LISP, Scheme, and Clojure, you’ll need semicolons to terminate (or separate) statements. This means you can (but shouldn’t) write code like:

#define ZERO 0
    unsigned  gcd(   unsigned   int  // Euclid's algorithm
      x,unsigned   y) {   while ( /* hello */  x>   ZERO
   ){unsigned temp=x;x=y   %x;y  = temp ;}return

   y ;}

“Significant” Newlines

Where you place your newlines matters greatly in, let’s see, Assembly languages, Python, Ruby, JavaScript, CoffeeScript, Elm, Haskell, Go, Swift, and yes, many others. The rules can get pretty technical.

Python scripts are defined as sequences of logical lines, delimited by the token NEWLINE. A statement may not cross logical lines, except in the case of compound statements in which each constituent simple statement ends with a NEWLINE. Logical lines are made up of one or more physical lines according to line joining rules. Lines are implicitly jointed within parentheses, brackets, or braces; lines can be explicitly joined by ending with a backslash. These rules are somewhat exclusive of comments and string literals.

Ruby looks at the end of each line and says “well if up to here it looks like we’ve completed a statement the we have.” This means you have to be careful where you break lines:

puts 5
  + 3
puts 5 +
  3

prints 5 then 8.

Exercise: Why?

“Possibly Significant” Newlines

JavaScript requires most statements to be terminated by semicolons, but the compiler will put one in for you if it looks like you might have missed one. The rules by which this automatic semicolon insertion (ASI) is done have to be learned and they might be hard to remember.

[God creating JavaScript]
GOD: It uses prototype-based inheritance.
Angel: Nice.
GOD: Also it secretly adds semicolons to ur code.
A: wat
— Neckbeard Hacker (@NeckbeardHacker) August 24, 2016

If you are going to be a serious JavaScript programmer, you need to learning the rules of ASI whether you choose to use semicolons or not.

Exercise: Research the famous Rules of Automatic Semicolon Insertion. Which statements are supposed to be terminated by a semicolon? When is a semicolon inserted? Give four examples of how writing JavaScript in a "free-form" manner is impossible because of semicolon insertion.

Exercise: Get your ASI Certification

Some people feel very strongly whether to use or not to use semicolons:

Function Calls

Questions to ask when invoking functions: Exactly one argument, or zero or more arguments? Parens or no parens? Positional or keyword arguments? If no arguments, can we omit parens then? Arguments first or last?

Let's play around and see what we come up with:

    push(myStack, 55)
    push myStack 55
    puts(on: myStack, theValue: 55)
    push(theValue: 55, on: myStack)
    push on:myStack theValue:55
    [push myStack 55]
    (push myStack 55)
    push({on: myStack, theValue: 55})
    push {on: myStack, theValue: 55}

    sum (filter even (map square a))
    sum $ filter even $ map square $ a
    sum <| filter even <| map square <| a
    a |> filter even |> map square |> sum

On the other side from calls is function definitions. Perhaps different styles of definition can force a particular style of call:

    def sqrt(x, /)
    def line(*, x1, x2, y1, y2, width, style, color)

Or perhaps both at the same time?

Syntactic Sugar

Syntactic sugar refers to forms in a language that make certain things easier to express, but can be considered surface translations of more basic forms.

This is best understood by example. There are zillions of examples out there. Here are a few. (Disclaimer: Some of these are just examples I made up and are not part of any real language.)

Construct	Desugared Form	Description
`x += n`	`x = x + n`	Compound assignment
`a + b`	`operator+(a, b)` or `"+"(a, b)` or `__add__(a, b)`	Common in languages that allow overloading
`a[i]`	`*(a + i)`	(C, C++ pointer arithmetic) And `i[a]` works too!
`p -> x`	`(*p).x`	(C, C++) Field of struct being pointed to
`f`	`f()`	Some languages let you leave off parentheses in calls with no arguments
`f x`	`f(x)` or `x.f()`	Some languages let you leave off parentheses in calls with one argument
`x op y`	`op(x, y)` or `x.op(y)`	Some languages let you leave off parentheses in calls with two arguments
`let x=E1 in E2`	`(x => E2)(E1)`	Let-expression (in functional languages)
`(E1 ; E2)`	`(() => E2)(E1)`	Expression sequencing (in eager functional languages)
`r = [s for x in a if e]`	`r = [] for x in a: if e: r.add(s)`	List comprehension
`x orelse y`	`if x then x else y`	(Standard ML) short-circuit disjunction
`x andalso y`	`if x then y else x`	(Standard ML) short-circuit conjunction
`[x, y, z]`	`x :: y :: z :: nil`	Lists in Standard ML
`"a${x}b"`	`"a" + x + "b"`	String interpolation

Exercise: Find some more examples.

When the sugared form is completely gratuitous or actually makes the code less readable, you sometimes hear the term syntactic syrup or syntactic saccharin.

Syntactic Salt

Here’s the definition from The New Hacker’s Dictionary:

The opposite of syntactic sugar, a feature designed to make it harder to write bad code. Specifically, syntactic salt is a hoop the programmer must jump through just to prove that he knows what’s going on, rather than to express a program action. Some programmers consider required type declarations to be syntactic salt. A requirement to write “end if”, “end while”, “end do”, etc. to terminate the last block controlled by a control construct (as opposed to just “end”) would definitely be syntactic salt. Syntactic salt is like the real thing in that it tends to raise hackers’ blood pressures in an unhealthy way.

Compactness and Verbosity

Some people love a very verbose syntax, where you say everything, because explicit is better than implicit. Some people love very terse syntax, as there is less cognitive load and less noise in the code. Please be reasonable, though. There is such a thing as code that is too terse, and such a thing as code that is too verbose.

Terseness

Some languages pride themselves on doing a whole lot with few characters:

An example from Ruby (do you see what this does?):

c = Hash.new 0
ARGF.each {|l| l.scan(/[A-Z']+/i).map {|w| c[w.downcase] += 1}}
c.keys.sort.each {|w| puts "#{w}, #{c[w]}"}

An example from APL (The 99 bottles of beer program taken from Rosetta Code):

bob  ←  { (⍕⍵), ' bottle', (1=⍵)↓'s of beer'}
bobw ←  {(bob ⍵) , ' on the wall'}
beer ←  { (bobw ⍵) , ', ', (bob ⍵) , '; take one down and pass it around, ', bobw ⍵-1}
↑beer¨ ⌽(1-⎕IO)+⍳99

Here’s APL again, with an expression to find all the prime numbers up to R:

(~R∊R∘.×R)/R←1↓⍳R

Candygrammars

Sometimes you will encounter languages that tend to look like a natural language. What do you think about this?

An example in Hypertalk (taken from Wikipedia):

on mouseDown
  answer file "Please select a text file to open."
  if it is empty then exit mouseDown
  put it into filePath
  if there is a file filePath then
    open file filePath
    read from file filePath until return
    put it into cd fld "some field"
    close file filePath
    set the textStyle of character 1 to 10 of card field "some field" to bold
  end if
end mouseDown

An example from Manatee:

to get the truth value prime of whole number n:
    return no if n < 2
    for each d in 3 to n - 1 by 2:
        return no if d divides n
    end
    return yes
end
for each k in 1 to 100:
    write k if prime(k)
end

In practice this kind of verbosity is worse than it sounds. Here’s what the New Hacker’s Dictionary has to say about this:

candygrammar /n./ A programming-language grammar that is mostly syntactic sugar; the term is also a play on “candygram.” COBOL, Apple’s Hypertalk language, and a lot of the so-called “4GL” database languages share this property. The usual intent of such designs is that they be as English-like as possible, on the theory that they will then be easier for unskilled people to program. This intention comes to grief on the reality that syntax isn’t what makes programming hard; it’s the mental effort and organization required to specify an algorithm precisely that costs. Thus the invariable result is that candygrammar languages are just as difficult to program in as terser ones, and far more painful for the experienced hacker.

[The overtones from the old Chevy Chase skit on Saturday Night Live should not be overlooked. This was a "Jaws" parody. Someone lurking outside an apartment door tries all kinds of bogus ways to get the occupant to open up, while ominous music plays in the background. The last attempt is a half-hearted "Candygram!" When the door is opened, a shark bursts in and chomps the poor occupant. There is a moral here for those attracted to candygrammars.]

Moving On

We’ve looked at questions such as these, that a language designer might address:

Is code expressed in terms of blocks? Visual graphs? Text?
If text, which characters are allowed? Is whitespace significant? Is case significant?
Is there a lot of punctuation? Symbols? Or is it mostly words? If words, what natural language are you biased toward?
How do we do comments? Pragmas? Directives?
Do we express structure with indentation, curly-braces, terminating-end, or nested parentheses?
Should operations have a functional (length(a)) or message-oriented (a.length) feel?
What kind of shorthands and idioms should be made available?
How do program units map to files or other physical containers?

But we’ve only scratched the surface. We haven’t considered higher level topics like having different syntactic modes, or allowing customization of syntax, or whether such ideas would even do more harm than good. What more questions can you ask?