The field we now call automata theory arose from many different areas of research, including:
These folks all came up with the notion of an automaton as a kind of machine, real or abstract, that runs by itself (hence the prefix “auto”) to process some information. There should be no limits to your imagination regarding how an automaton should or should not look, nor what kinds of internal operations it carries out (as long as it is finitely describable).
A brief history of the field can be found at Wikipedia.
We want a rigorous theory of automata. Therefore, we will make the following assumptions about the automata that we study:
The first assumption is reasonable because a machine that requires an infinite description could not be built, let alone understood. The second assumption is reasonable for reasons we discussed previously in our notes on language theory.
We can give a picture of an automaton as follows:
For computations with multiple inputs, we can make a single string with a special separator symbol.
Why create and study a theory of automata? Two reasons:

Here’s how we’ll proceed:
We will look at examples first, because knowledge advances by people make things before they generalize what they have done into theories.
Turing Machines are so historically significant and influential, that I have an entire separate page of notes for them, including how Turing came up with the idea and design. For now, let’s just look at the basics.
In a TM, we begin with the input sitting on an infinite tape of cells surrounded by blanks. The machine has a head pointing to the first symbol of the input. The machine’s control is a finite set of rules that says “If in state $q$ looking at symbol $\sigma$, then (1) optionally replace the current symbol with a different symbol $\sigma'$, (2) optionally move either one cell right or left, and then (3) optionally transition to a different state $q'$.” The machine computes by following the rules one after the other. When it gets to a configuration in which there is no rule for the current state and currently scanned symbol, it halts.

We’ll cover Turing Machines in detail later, but for now here’s an example Turing Machine to increment a binary number:

Here is the machine in action, incrementing 100111:
S
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
S
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
S
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
S
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
S
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
S
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
F
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 1 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
F
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 1 | 0 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
F
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 1 | 0 | 0 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
F
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 0 | 0 | 0 | 0 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
I
──┬───┬───┬───┬───┬───┬───┬───┬───┬───┬──
│ # │ # | 1 | 1 | 0 | 0 | 0 | # | # |
──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──
A variation on the Turing Machine is to make each action be comprised of a mandatory print and a mandatory move. In that variation, the state diagram for our binary number increment machine would be written like so:

The mandatory move part is a little annoying, since there is often a useless move at the end of a computation (as in the above example). If it is too annoying, then add to L and R the value N, for “no move” or “neutral”.
Get used to multiple variations and notationsIt turns out there are way more than two variations of the Turing Machine, and the notations can vary quite a bit. They all have in common that “when in state $q$ looking at $\sigma$, optionally perform actions then optionally transition to a new state”.
These notes will use both transition notations (e.g.,
0→Rfor zero-or-more actions at a time and00Rfor exactly-one-write-and-one-move), so get used to reading both. They are equivalent in computing power.
Generally, when the machine halts, the result of the computation is whatever is left on the tape (between the infinite runs of blanks). But sometimes, we want a computation that just answers a YES-NO question. In these cases, it is super super convenient to not have to actually write YES or NO, but instead introduce the notion of an accept state. Stopping in an accept state means the answer is YES. Such a machine is called a recognizer because it just recognizes which inputs produce a YES answer. A machine intended to produce actual output data is called a transducer. Here is a recognizer to determine if its input, a binary number, is divisible by three:

Jade has a little explainer on Turing Machines (in which she uses the notation for the exactly-one-print, exactly-one-move model):
Watching videos can be super helpful, but you’ll want to follow up with practice.
Pretend you are Turing machine. Painstakingly work out the computations of incrementing the binary number 11010011111, and determining whether 11010011111 is divisible by three. Write out each “configuration” (state and tape contents) you pass through.
We’ll see later that this “single-tape” Turing Machine is, as far as we know, capable to computing all computable functions. It’s simplicity makes it the go-to model for studying computation! But for convenience, we can extend the model with multiple tapes! This doesn’t add any more computing power: you can show that any TM with multiple tapes can be simulated with a 1-tape TM.

Single and multi-tape Turing Machines have equal computing power. Thanks to their infinite tapes, they have a lot of power. What if we bound the tape?
A Linear Bounded Automaton, or LBA, is essentially a Turing Machine whose tape is not infinite but rather is bounded to the size of its input plus two additional read-only cells: one to the left of the input containing an non-replaceable “left-end-marker symbol” and one to the right containing a non-replaceable “right-end-marker” symbol. The machine is not allowed to move beyond these markers, nor is it allowed to write the marker characters anywhere else on the tape. It may, of course, freely write any non-marker symbol between the markers.

It feels a little more realistic, closer to a real computer, because of it having a finite “memory,” right?
Turing Machines and LBAs can move back and forth across their tapes. There’s an interesting restriction you can make that requires you to read through the input exactly once from left to right only, and only use a stack for working memory. Why might this be nice? Stack push and pop, you may remember, are $\Theta(1)$ (constant) time operations.
A Pushdown Automaton, or PDA, is essentially a Turing Machine with (1) a read-only input tape with no left moves allowed, (2) a separate unbounded stack that is initially empty, and (3) optionally, an append-only output tape. The transition rules always take into account the current state, but may also consider the current input symbol (in which case the input head is advanced to the right) and may also consider the symbol on the top of the stack (in which case it is popped). The action fired by a rule is to (optionally) push symbols on the stack, (optionally, if a transducer) append symbols to the output tape, and (optionally) move to a new state.

Some problems don’t need any memory at all. Just zoom through the input one symbol at a time and be done!
A Finite Automaton or FA, is a Turing Machine with a read-only input tape and a head that moves one cell to the right at every step (there are no left-moves and no no-moves). Optionally, it may produce output on a separate write-only tape that is initially blank and only appended to.

It’s FA, not FSMSometimes you will hear folks call an FA a “finite state machine” or FSM for short. This is confusing because all automata have a finite number of states. Also, the acronym FSM is easily confused with the deity of Pastafarianism.
When describing FAs, we don’t have to worry about blanks, or say whether we are moving left or right! All we need is the input symbol and the output symbol(s). For a first example, here is a transducer FA that computes the one’s complement of its input:

For recognizers, we don’t have any output at all, so the machine transitions just have the input symbol. Here’s an FA to determine whether its input has three consecutive $a$’s somewhere:

An FA that writes to its output on a transition is called a Mealy Machine. You can also make an FA transducer that outputs a symbol whenever it enters a state; this would be called a Moore Machine.
A queue automaton is a TM that, instead of moving left or right at each step, instead always dequeues the symbol at the left and enqueues some other string of symbols on the right.
We can mechanically compute by applying rules (from a finite set of rules) that allow us to rewrite part of a string into another. We start with the input surrounded by begin and end markers [ and ], and keep rewriting until there are no more rules to apply.
For example, here’s an instruction set for a rewrite machine that increments binary numbers:
Some example runs:
How about a formal definition?
A rewrite machine is a tuple $(\Sigma, R)$ where:
The one-step derivation relation $\Longrightarrow_R \;\subseteq \Sigma^* \times \Sigma^*$ is defined as: $xuy \Longrightarrow_R xvy$ iff $u R v$.
A rewrite machine $(\Sigma, R)$ computes output $w'$ from input $w$ iff $$[w] \Longrightarrow^*_R w' \wedge \forall\,x\,y\,z . w' = xyz \supset \neg \exists u . yRu$$
We can make a rewrite machine be a recognizer, too. Here is a rewrite machine that determines whether or not its input is in the form $a^nb^nc^n$, for $n \geq 0$:
That might have been simpler than you thought it might be, given the complexity of the analytic grammar, the generative grammar, and even the Turing Machine we saw for this language earlier!
Let’s extend the definition above:
A rewrite machine $(\Sigma, R)$ accepts $w$ iff $[w] \Longrightarrow^*_R \fbox{YES}$
Now let’s make something a bit less primitive and way more useful.
A register machine has a number of registers $r_0, r_1, r_2, \ldots$, each holding a single (unbounded) natural number. The machine’s control is a program, defined as a list of (labeled) instructions. The instructions are normally executed in sequence, but some instructions can cause a “jump” to an arbitrary instruction. The instruction set varies from model to model (of which there exist an infinite number of possibilities). Here’s an example of a register machine with a very small instruction set:
| Instruction | Description |
|---|---|
| $\textsf{load } r_k, n$ | $r_k := n$ |
| $\textsf{copy }r_i, r_j$ | copy contents of $r_j$ into $r_i$ |
| $\textsf{add } r_i, r_j, r_k$ | Store sum of contents of $r_j$ and $r_k$ into $r_i$ |
| ... and so on for $\textsf{sub}$, $\textsf{mul}$, $\textsf{div}$, $\textsf{pow}$, $\textsf{and}$, $\textsf{or}$, $\textsf{lt}$, $\textsf{le}$, $\textsf{eq}$, $\textsf{ne}$, $\textsf{ge}$, $\textsf{gt}$, $\textsf{shl}$, $\textsf{shr}$ | |
| $\textsf{neg } r_i, r_j$ | Store negation of the contents of $r_j$ into $r_i$ |
| $\textsf{not } r_i, r_j$ | Store bitwise negation of the contents of $r_j$ into $r_i$ |
| $\textsf{jump } n$ | Jump ahead (or back, if $n$ is negative) $n$ instructions |
| $\textsf{jz } r_i, n$ | Jump ahead (or back, if $n$ is negative) $n$ instructions if contents of $r_i$ is $0$ |
Register machines are rather important, as they are pretty much the basis for most “real” machines today. We’ll discuss them in more detail later, where we will connect them with modern machines with large and powerful instruction sets.
A stack machine is pretty much a register machine where the register operands are implicit: the register sequence is treated as a stack. We’ll see them in details in the notes on register machines.
They are very popular because of their simplicity. In fact, the operation of the JVM and WebAssembly is based on stack machines.
Yes, people have created probabilistic automata and quantum automata, too.
Not to mention Büchi, Rabin, Strett, Parity, Muller, Analog, Continuous, and Cellular automata. There are no doubt others.
Oh, and do check out the amazing SECD Machine, designed to evaluate lambda calculus expressions. It’s a bid of hybrid, containing components from different models.
Now that we’ve seen specific automata, we can now generalize and come up with a useful vocabulary.
There are many dimensions on which we can characterize automata:
But as these notes are introductory, we will focus only on the first four dimensions.
There are different ways to specify the operation of an automaton. The main options are (1) state machines, (2) instruction sets, and (3) rewrite rules.
| If you are in this state | and this event occurs | Then perform this action | and move to state |
|---|---|---|---|
State machines can also be given with pictures:

If there can be more than one action/transition to take from a given state/input configuration, the machine is said to be non-deterministic, otherwise it is deterministic.
┌───────────────────┐ │ Instruction 0 │ ├───────────────────┤ │ Instruction 1 │ ├───────────────────┤ │ Instruction 2 │ ├───────────────────┤ │ . │ │ . │ │ . │ ├───────────────────┤ │ Instruction n-1 │ └───────────────────┘
Register machines are by nature deterministic, but you can fake nondeterministic behavior by introducing an instruction that randomly chooses between multiple next instructions.
| If you see this substring | Rewrite with this substring |
|---|---|
The three operational variations are convenient, but neither is really any more computationally powerful than the other. Each can simulate the others.
There’s a lot of variety in memory, but two primary kinds are:
─┬───┬───┬───┬───┬───┬───┬───┬───┬───┬─
• • • │ │ | | | | | | | | • • •
─┴───┴───┴───┴───┴───┴───┴───┴───┴───┴─
┌────────────────────┐
r0 │ │
├────────────────────┤
r1 │ │
├────────────────────┤
r2 │ │
├────────────────────┤
r3 │ │
├────────────────────┤
r4 │ │
├────────────────────┤
What other kinds of memory can you come up with?
One idea: multi-dimensional tapes, which would allow for, say, up, down, left, and right movements, and perhaps more. Use your imagination to build new computing machines.
Here’s an unnecessary distinction, but one that is super convenient to make.


If a recognizer machine $M$ says YES to input $w$, we say that $M$ accepts $w$.
A recognizer that always finishes and answers YES or NO (and never loops) is called a decider.
One final variation of interest, applicable to instruction-based automata only, really, is: where are the instructions stored? There are two main architectures:
Here’s a table summarizing the variations we’ve discussed:
| Dimension | Options | Description |
|---|---|---|
| Control | State machine | Control is a states with transitions |
| Instruction list | Control is a list of instructions (which may contain jumps) | |
| Rewrite rules | Rules successive rewrite portions of the data | |
| Memory | Tapes | Located by moving through the cells one at a time |
| Registers | Located by register index | |
| Output | Transducer | Produces an arbitrary output |
| Recognizer | Outputs YES or NO only | |
| Architecture | Harvard | Code is separate from the data |
| Von Neumann | Code and data are stored together |
A recognizer automata actually defines a formal language! That is, the language of recognizer machine $M$, or the language recognized by $M$, is:
$$ L(M) = \{ w \mid M \textrm{ accepts } w \} $$Nice! This means a formal language can be defined by not only by a grammar but also by a machine. This is philosophically interesting, if not profound.
The notions of “decision problem” and “language” are interchangeable.
Also if $L$ is a language, then $M_L$ is a recognizer machine for $L$.
How about an example, to understand the important concepts here, and to get the vocabulary right? Let $M$ be the Turing Machine given by the following state diagram:

When run on any non-blank input of binary digits, the machine will end in the accept state iff the input ends with a 0. Otherwise it will get stuck. Therefore we say:
Accepting vs. DecidingIn tne example above, the machine $M$ did indeed both accept and decide the language of even binary integers. However, there do exist languages that are recognizable by a TM but not decidable by any TM. The line between the acceptable and the decidable is part of computability theory, which we’ll study later.
It turns out that different kinds of (recognizer) machines can recognize different classes of languages. Here is a progression of machine types with strictly increasing recognition power:
| Automata | Languages that can recognized |
|---|---|
| Loop-free FA | Finite Languages (languages with a finite number of strings). |
| FA | Regular Languages. You can recognize these without any external memory whatsoever. The only “memory” (if you can call it that) is where we are in the input and what state we are in. These languages are recognizable in linear time. |
| Deterministic PDA (DPDA) | Deterministic Context-Free Languages. An interesting subset of the context-free languages that show up in parsing theory. |
| Nondeterministic PDA (NPDA) | Context-Free Languages. Recognized by only using stack memory, which is super fast. |
| LBA | Type-1 Languages. Who knew? Bounding the tape of a TM reduces the computing power significantly! |
| TM that always halts | Decidable Languages. Also known as the Recursive Languages, $R$. |
| TM | Type-0 Languages. Recursively Enumerable Languages, $RE$. For languages that are in $RE$ but not in $R$, a TM can be found that will accept (say YES) every string in the language, but will not be able to halt and say NO for all strings not in the language. |
DeterminismMachines that allow more than one possible operation from a given configuration are called nondeterministic, and those that allow only one are called deterministic. Okay, fine, but here’s the philosophical question: is nondeterminism more “powerful” than determinism? That is, can we recognize more languages with nondeterministic machines of a given class than deterministic ones? It turns out:
- For FAs and TMs, there is NO difference in recognition power between deterministic and nondeterministic versions.
- For PDAs, there IS a difference in recognition power between deterministic and nondeterministic versions.
- For LBAs, we don’t know!
Who knew right? 🤯
These classes form a hierarchy of language recognition power, with each language class being a proper subset of the next. Here’s another view of the table above, illustrating the containment relationships, with example languages in each region:
There were new languages there!You may have noticed the languages $PRESBURGER$, $H$ and $\overline{H}$ in the diagram above. We haven’t seen them yet. But we will. For now, just note that it is pretty cool that we can identify languages in some of these interesting families.
In many cases, the class of languages accepted by a type of particular machine correspond exactly to the class of languages generated by a type of grammar. And for those language classes that cannot be characterized by a grammar, there are no automata that characterize them either!
| Language Type | Description | Grammar | Recognizer |
|---|---|---|---|
| Finite | A language with only a finite number of strings. | $S \rightarrow w_1 | \ldots | w_n$ | Loop-free FA |
| Regular (Type 3) |
A language in which strings can be recognized by processing them one symbol at a time, without backtracking, and without using additional memory. | RLG LLG |
FA |
| Deterministic Context Free (LR) |
A language recognizable on a deterministic PDA. (Stacks are nice! Θ(1) access! Determinism is nice too! Much more efficient.) | LR | DPDA |
| Context Free (Type 2) |
A language recognizable on a nondeterministic PDA. (Stacks are nice! Θ(1) access! But with nondeterminism, recognition is slower.) | CFG | PDA |
| Context Sensitive (Type 1) |
A language recognizable on an LBA. (The name “context-sensitive” is confusing because intuitively all languages “beyond” Type 1 in the hierarchy are sensitive to context). | ENCG CSG |
LBA |
| Recursive (R) | A language in which string membership is decidable. There are no predetermined memory bounds, and no bounds on the number of steps. | (No name) | TM that always halts |
| Recursively Enumerable (Type 0 / RE) |
A language in which string membership is partially decidable: if a string is in the language, you can determine that for sure; if a string is not, you might not be able to determine this. | Grammar | TM |
| co-RE | A language whose complement is RE. If a string is not in the language, you can determine that for sure; if a string is in the language, you might not be able to determine this. | Not computable | |
| Finitely Describable | A language that has a finite description | Not computable | |
| All Languages ($\mathcal{P}(\Sigma^*)$) |
All possible languages over the alphabet | Not computable | |
Fun fact: $\textit{RE} \cap \textit{co-}RE = R$.

The language class RE lines up with what we understand to be computable.
Can any type of automata, or other computational model, compute a larger class of functions than a Turing Machine? We don’t think so. In fact, the following are all equal in computing power:
The fact that these all are essentially the same thing, and all attempts at making them more powerful don’t actually increase their power, led to the Church-Turing Thesis, which states:
The Turing Machine captures exactly we intuitively think of as computable.
There are hundreds (no kidding) of other classes of languages. Perhaps the two most famous are:
Note, by definition, $\mathit{P} \subseteq \mathit{NP}$, but no one has been able to prove whether or not $\mathit{P} = \mathit{NP}$. There is a cash prize waiting for you if you can do it.
The language classes discovered and invented in the study of complexity theory can be mixed into the language classes we’ve seen already. Gabe Robins created this beautiful visualization:
There’s so much more, including languages tied to probabilistic and quantum computing. All of the language families described above, as well as hundreds more, can be found in the Complexity Zoo.
One of the reasons we have theories is to apply them. (Remember, a theory is an organized body of knowledge with explanatory and predictive powers.) What are some applications of automata?
Type-3 machines are used in Lexical Analysis, (extracting “words” from a string of characters), an important operation in text processing, and especially in compiler writing.
Type-2 machines are used in Parsing, uncovering the structure of a string of symbols, another hugely important and common operation.
Automata Theory, like any theory, has its small building blocks. We aim for machines that are so small and so easy to understand, that their operation can be fully grasped and analyzed. If the operations are instructions, and we have a way to represent these instructions, we have an interesting blurring between instructions and machines.
Brainfuck (or Brainf**k as it is commonly known) is a programming language with an explicit machine execution model featuring (1) a fixed program stored on a read-only tape where each cell holds one of eight instructions, (2) a read-only input stream of bytes (you can think of it as a tape), (3) a write-only output stream of bytes (also a tape), and (4) a data tape with a fixed left end and infinite to the right (although really the language specification says at least 30,000 cells). The eight instructions are:
| Instruction | Description |
|---|---|
+ | Increment the byte in the current data cell |
- | Decrement the byte in the current data cell |
. | Output the byte in the current data cell |
, | Read next byte from input and store in current data cell |
< | Move left on tape |
> | Move right on tape |
[ | If byte in current data cell is 0, jump to instruction right after the next matching ] |
] | If byte in current data cell is not 0, jump to instruction right after the previous matching [ |
Here’s a program in the language. Can you figure out what it does?
+++++++[>+++++[>++>+++<<-]<-]>>++.>.
7 5 2 3
7 0 10 15
0 0 70 105
72 105
See the Language overview at Esolang, the Wikipedia article, and this collection of example programs.
$\mathcal{P''}$ is one of the earliest of the tiny languages. Its alphabet is only four symbols, each mapping to a specific instruction. It’s perhaps too small to easily study, but fascinating for its minimalism.
We’ll cover some of these in class.
Think about it: when you give the encoding of a Turing machine, aren’t you in some sense expressing a computation? Machine descriptions are communicated in a language, so they can be viewed as a program. We’ll provide only a simple example, which will be “run” in class:
SCANNING,0,0,R,SCANNING;
SCANNING,1,1,R,SCANNING;
SCANNING,_,_,L,FLIPPING;
FLIPPING,1,0,L,FLIPPING;
FLIPPING,0,1,N,DONE;
FLIPPING,_,1,N,DONE;
We can extend the minimal models of computation by adding more and more features to make machines of arbitrary complexity, that are low-level enough to precisely define and implement in software and hardware. More about these later in the course.
Given automata for different operations, we can combine them in various ways to make automata for more complex operations. Here’s a simple example:

and a slightly larger one:

Check out the following sources for more information:
Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.
+: increment current cell-: decrement current cell,: read.: write<: move pointer left>: move pointer right[: jump past matching ] if current cell is 0]: jump back to matching [ if current cell is nonzeroWe’ve covered: