So, you want to design your own language? Of course you do. Or perhaps you are taking a class and are being forced to create a programming language under penalty of a bad grade. What kinds of things do you need to know?
Prerequisites
It helps to be experienced. But it’s okay if you’re not—you can get lucky, too!
You should have a good sense of:
Programming Paradigms: imperativedeclarativestructuredobject-orientedfunctionalapplicativeconcatenativelogicprotocol-orientedaspect-orientedarrayevent-drivendataflowagent-based etc.
Remember, many people have designed languages before you. They made mistakes. They came up with brilliant ideas. Many were wildly successful. Some never made it big. Some people have brought in years of research on how people think and learn to come up with principles for language (and environment) design.
You should learn form their experiences. Study classic papers. Read web essays. Here is a small sampling:
Ready to strike out on your own? Do the following, in no particular order:
Understand the audience that the language is designed for, and what kinds of things they want to create with it, or problems they want to solve with it.
Sketch programs or program fragments in your language.
Sketch fragments that are not in your language.
Come up with a list of semantic capabilities, or features. Make sure they enable programmers to express their creations by following the suggestions and principle in the Learnable Programming essay, including:
Make meaning transparent
Explain in context
Make flow and time tangible and visible
Show the data, and avoid hidden state
Use metaphor
Allow easy decomposition and recomposition
Allow the programmer to write readable code—you do support mentioning parameter names in calls, right?
Come up with a nice, consistent, syntactic theme.
Then:
What will you call your language, and why?
Remember that design is an iterative process, and creativity is enabled and enhanced with immediate feedback. So you should design with tools that allow you to experiment and test your ideas.
In the upper left panel, design your grammar. You can load/save from your browser’s local storage, and even publish gists to GitHub. In the upper right panel, enter test cases: both tests you want to succeed (thumbs up) and those you want to fail (thumbs down). The bottom panel is an interactive concrete syntax tree for the currently selected test case.
This tool will save you a lot of time.
Things To Think About
Okay I’m just going to rattle off a bunch of this that come into my head that you may or may not want to think about when you design your language:
Are you trying to make a reasonable, usable language or an esoteric/joke/golfing/weirdo language?
Do you have a specific audience in mind? Artists? Graphic designers? AI researchers? Numeric and Scientific nerds? Natural language types? Game developers? Machine learning people? Animators? High performance folks? System programmers? Or do you want a general purpose language? Or do you just want to do what you want?
Do you want to be pragmatic, idealistic, researchy, or evil?
Do you want your language firmly in one camp—OO, functional, logic, concatenative, plain-imperative—or be a multiparadigm symphony? Or a multiparadigm cacophany?
Do you want a language built on a single characteristic building block (a crystallization of style) or one with a huge variety of syntactic forms for multiple semantic aspects (an agglutination of features)?
What is your scheme for showing structure? Curly-brace (JavaScript, Java, C++, C#), Terminal-end (Ruby, Ada), Nested parentheses (Lisp, Clojure, Racket), Indentation (Python, CoffeeScript), Blocks (EToys, Scratch, Snap!), Pictures (Piet), Other (Haskell, Erlang, Prolog)?
If your language supports functions:
Must you pass the exact number of arguments the function expects?
Can you specify parameter names in the call?
Do you have rest parameters? optional parameters? keyword parameters? default parameters?
Are the parameters typed?
Can parameters be modified? Are they passed by value, value-result, reference, name, etc?
Are arguments evaluated L-R, R-L, in parallel, or in arbitrary order?
How does a function return a value? Can it return zero or more values or just one?
Are functions first-class? If so, do you use deep binding or shallow binding?
Is recursion allowed?
Can you test functions for equality?
Can functions be anonymous?
What is your type system like?
Static vs. dynamic, strong vs. weak, manifest vs. implicit?
Do you distinguish primitive types from reference types?
Do you have all those different numeric types based on bit size? What about decimals and ratios?
Do you have a separate character type? A separate boolean type?
Can you get the type of an expression at runtime?
Are types objects?
Do you have supertypes and subtypes? Multiple-inheritance? If so, how do you handle name collisions?
Are types the same as classes? Can you add new types?
Do you have both sum and product types?
Do functions have a type?
Are there pointer types?
Are there parameterized types? If so, are they (mostly) covariant, contravariant, or invariant?
Any dependent types?
What do your expressions look like?
And does your runtime follow eager or lazy evaluation? Or both?
Do you have only prefix, only postfix, only infix, or a wild mix of operators?
Can you overload operators? If so, how?
Can you change the precedence of operators? Even the built-in ones? At runtime?
How much type inference do you have?
Can you mark variables as mutable or immutable?
Are your variables bound or assigned? Can they be assigned more than once?
Do you have destructuring and/or pattern matching?
How do you determine scope? Do you implicitly or explicitly import into inner scopes?
Are there keywords that define access like public, private, and protected, or are there conventions, like in Go, where capitalized entities are implicitly exportable and lower-cased entities are private?
Is there a let-expression for super-local scopes?
Is shadowing allowed?
Do you have anything like JavaScript’s this expression?
How do you express control flow?
Do you like expression-orientation or do you have real statements?
How do you express sequential flow vs. nondeterministic flow vs. parallel flow?
If you have nondeterminism, how do you ensure fairness?
Must guards in a multiway selection be executed in any particular order?
Do you have short-circuit operators? Iterator objects? Generators?
Do you have all the loops: (1) forever, (2) n times, (3) while/until a condition, (4) through a numeric range (with an optional step size), or (5) through a collection?
Do you have break and continue? Anything like Ruby’s retry and redo?
Do you have exceptions? Or do you need Go-style constructs? Or do you have nullables everywhere?
Do you have a timer for sleeping, delaying execution, or running on intervals?
Can I haz goto?
How do you support concurrency?
Threads, or events?
Shared memory only? Message passing only? Or both? If shared memory, is it mutable? If so, do you have locks, higher-level synchronization devices, or some kind of transactional memory?
Do you support different levels of granularity of concurrency?
Are tasks spawned implicitly when their enclosing block begins, or do they require explicit invocation?
Do tasks die when the enclosing block or spawning task terminates? Or does the spawner wait for all internal tasks to terminate?
Do taks know who started them?
If an asynchronous task is launched, how is a result obtained? Callback, promise, or some other mechanism?
Is message passing done via named channels, named processes, or bulletin board?
Is message passing always synchronous, asynchronous, or both?
Is there timed or conditional synchronization mechanisms?
Can tasks detect their own state?
How meta are you?
Can a program get a list of its global variables, loaded classes, top-level functions, etc.? Can a class get a list of its fields, methods, constructors, superclass(es), subclasses, etc.? Can a function get its parameters, locals, return types, etc.?
Can a function find out who called it?
Can a variable be read or set via its name (a string) only? Can a function or method be invoked by its name (a string) only? Can new variables or types or functions be created at run time, given only their name and some indication of how they should look?
Does the language have a macro system? If so, is it string-based or AST-based?
Is it possible to unquote within a macro?
Are macros hygienic? What syntactic devices are provided to explicitly capture in- scope variables, if any?
[God creating JavaScript] GOD: It uses prototype-based inheritance. Angel: Nice. GOD: Also it secretly adds semicolons to ur code. A: wat— Neckbeard Hacker (@NeckbeardHacker) August 24, 2016
Theoretical Concerns
Moving on from our particular example, let’s look at things from a more academic perspective and consider real-life programming language definitions. There seem to be three ways to produce a language definition:
An official document, with a mix of formal notation and informal descriptions.
An official document, with 100% of the definition specified in a formal notation.
A reference implementation, namely a compiler, so that the language definition is “whatever this program does.”
Some language definitions are sanctioned by an official standards organization (like ISO, IEC, ECMA, ANSI, etc.) while some don’t even care about standardization.
Exercise: Create a bibliography of the official standards for the major programming languages.
Usually a language is defined by considering its:
Syntax
Structure
Semantics
Pragmatics
Pragmatics
Usage
Syntax
We have a lot of options for defining a syntax:
CFG (Context Free Grammar)
BNF (Backus-Naur Form)
EBNF (Extended Backus-Naur Form)
ABNF (Augmented Backus-Naur Form)
Syntax Diagrams (a.k.a. Railroad Diagrams)
PEG (Parsing Expression Grammar)
Ohm Grammar (variation on the classic PEG)
The first five forms are all equivalent. They describe exactly the class of context-free languages. PEGs capture a different set of languages, including some context-sensitive languages like $a^nb^nc^n$.
It turns out general parsing of CSGs is hard or inefficient. So we normally give a grammar for the context-free parts and leave the context-sensitive parts, like “variables must be declared before being used,” to the semantics.
Exercise: Find out if there are any known complexity bounds (upper or lower) for parsing a context sensitive language.
Syntax is usually (but not always) divided into:
A microsyntax, specifying how the characters in the source code stream are grouped into tokens. The microsyntax deals with whitespace, comments, and case sensitivity.
A macrosyntax, speficying how the tokens are grouped into phrases (such as expressions, statements, declarations, etc.)
An abstract syntax, which is a much-simplified restructuring of the macrosyntax.
Semantics
A language’s semantics is specified by mapping its syntactic forms (often abstract syntax tree fragments) into their meaning. Common approaches include:
Natural Language (informal)
“Compile it and run it” (in which the compiler itself defines the language, and thus the compiler correctness problem is trivial—the compiler is by definition always correct)
Denotational Semantics
Operational Semantics
Axiomatic Semantics
Action Semantics
A hugely important distinction is that between:
Static Semantics: which deals with legality rules—things you can check before running the code (compile time), and
Dynamic Semantics: which deals with the execution behavior; things that can only be known at runtime.
Example: You’ve probably heard the distinction between “static” and “dynamic” before. Recall that a statically-typed language in which type checking is done prior to program execution and a dynamically typed language is one in which type checking is done during program execution. Most languages do a little of both, but one or the other usually predominates.
Pragmatics
Pragmatics does not affect the formal specification of programming languages. However, pragmatic concerns must guide your design of a programming language, if you want it to be easy to read, easy to write, and able to be implemented efficiently. Pragmatics encompasses:
Common programming idioms (the right ways and the wrong ways of doing things)