regex.png

Regular Expressions

Regular Expressions are way cool. Knowledge of regexes will allow you to save the day.

Definitions

In formal language theory, a regular expression (a.k.a. regex, regexp, or r.e.), is a string that represents a regular (type-3) language.

Huh??

Okay, in many programming languages, a regular expression is a pattern that matches strings or pieces of strings. (Incidentally, the set of strings they are capable of matching goes way beyond what regular expressions from language theory can describe.)

Basic Examples

Rather than start with technical details, we’ll start with a bunch of examples.

Regex Matches any string that
hello contains {hello}
gray|grey contains {gray, grey}
gr(a|e)y contains {gray, grey}
gr[ae]y contains {gray, grey}
b[aeiou]bble contains {babble, bebble, bibble, bobble, bubble}
[b-chm-pP]at|ot contains {bat, cat, hat, mat, nat, oat, pat, Pat, ot}
colou?r contains {color, colour}
rege(x(es)?|xps?) contains {regex, regexes, regexp, regexps}
go*gle contains {ggle, gogle, google, gooogle, goooogle, ...}
go+gle contains {gogle, google, gooogle, goooogle, ...}
g(oog)+le contains {google, googoogle, googoogoogle, googoogoogoogle, ...}
z{3} contains {zzz}
z{3,6} contains {zzz, zzzz, zzzzz, zzzzzz}
z{3,} contains {zzz, zzzz, zzzzz, ...}
[Bb]rainf\*\*k contains {Brainf**k, brainf**k}
\d contains {0,1,2,3,4,5,6,7,8,9}
\d{5}(-\d{4})? contains a United States zip code
1\d{10} contains an 11-digit string starting with a 1
[2-9]|[12]\d|3[0-6] contains an integer in the range 2..36 inclusive
Hello\nworld contains Hello followed by a newline followed by world
b..b contains a four-character (sub)string beginning and ending with a b (Note: depending on context, the dot stands either for “any character at all” or “any character except a newline”)
\d+(\.\d\d)? contains a positive integer or a floating point number with exactly two characters after the decimal point.
sh[^io]t contains sh followed by any character other than an i or o, followed by t
//[^\r\n]*[\r\n] contains a Java or C# slash-slash comment
^dog begins with "dog"
dog$ ends with "dog"
^dog$ is exactly "dog"

Notation

There are many different syntaxes for regular expressions, but in general you will see that:

Using Regular Expressions

Many languages allow programmers to define regexes and then use them to:

Generally a regex is first compiled into some internal form that can be used for super fast validation, extraction, and replacing. Sometimes there is an explicit compile function or method, and sometimes special syntax is used to compile, such as the very common form /.../.

Validation

Example: find "color" or "colour" in a given string.

// Java
Pattern p = Pattern.compile("colou?r");
Matcher m = p.matcher("The color green");
m.find();                           // returns true
m.start();                          // returns 4
m.end();                            // returns 9
m = p.matcher("abc");
m.find();                           // returns false
# Perl
$p = /colou?r/;
"The color green" =~ $p;            # returns 1 (cuz no Perl true)
"abc" =~ $p;                        # returns 0 (cuz no Perl false)
# Ruby
p = /colou?r/
"The color green" =~ p              # returns 4
"abc" =~ p                          # returns nil
# Python
p = re.compile("colou?r")
m = p.search("The color green")
m.start()                           # returns 4
m = p.search("abc")                 # returns None
// JavaScript
const p = /colou?r/;
"The color green".search(p);        // returns 4
"abc".search(p);                    // returns -1

If you want to know if an entire string matches a pattern, define the pattern with ^ and $, or with \A and \Z. In Java, you can call matches() instead of find().

Extraction

After doing a match against a pattern, most regex engines will return you a bundle of information, including such things as:

Example in Ruby:

>> pattern = /weighs (\d+(\.\d+)?) (\w+)/
=> /weighs (\d+(\.\d+)?) (\w+)/
>> pattern =~ 'The thing weighs 2.5 kilograms or so.'
=> 10
>> $&
=> "weighs 2.5 kilograms"
>> $1
=> "2.5"
>> $2
=> ".5"
>> $3
=> "kilograms"
>> $`
=> "The thing "
>> $'
=> " or so."

The same thing in JavaScript:

> const pattern = /weighs (\d+(\.\d+)?) (\w+)/
> pattern.exec('The thing weighs 2.5 kilograms or so.')
[ 'weighs 2.5 kilograms',
  '2.5',
  '.5',
  'kilograms',
  index: 10,
  input: 'The thing weighs 2.5 kilograms or so.' ]

Note how in JavaScript, the match result object looks like an array and an object.

The so-called group numbers are found by counting the left-parentheses in the pattern:

TODO PICTURE GOES HERE

Sometimes you need parentheses only for precedence purposes and you don’t want to incur the cost of extracting a group. We have non-capturing groups for this purpose.

Ruby:

>> phone = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
=> /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/
>> phone =~ 'Call 555-1212 for info'
=> 5
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["Call ", "555-1212", " for info", nil, nil, "555", "1212", nil]
>> phone =~ '800.221.9989'
=> 0
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["", "800.221.9989", "", "800.", "800", "221", "9989", nil]
>> phone =~ '1800.221.9989'
=> 1
>> [$`, $&, $', $1, $2, $3, $4, $5]
=> ["1", "800.221.9989", "", "800.", "800", "221", "9989", nil]

JavaScript:

> const r = /((\d{3})(?:\.|-))?(\d{3})(?:\.|-)(\d{4})/g;
> const m = r.exec("Call 1.800.555-1212 for info");
> m.index
7
> JSON.stringify(m);
["800.555-1212","800.","800","555","1212"]

Java

Pattern phone = Pattern.compile("((\\d{3})(?:\\.|-))?(\\d{3})(?:\\.|-)(\\d{4})");
String[] tests = {"Call 555-1212 for info", "800.221.9989", "1800.221.9989"};
for (String s : tests) {
    Matcher m = phone.matcher(s);
    m.find();
    System.out.println("groupCount = " + m.groupCount());
    System.out.println("group(0) = " + m.group(0));
    System.out.println("group(1) = " + m.group(1));
    System.out.println("group(2) = " + m.group(2));
    System.out.println("group(3) = " + m.group(3));
    System.out.println("group(4) = " + m.group(4));
}
groupCount = 4
group(0) = 555-1212
group(1) = null
group(2) = null
group(3) = 555
group(4) = 1212
groupCount = 4
group(0) = 800.221.9989
group(1) = 800.
group(2) = 800
group(3) = 221
group(4) = 9989
groupCount = 4
group(0) = 800.221.9989
group(1) = 800.
group(2) = 800
group(3) = 221
group(4) = 9989

Substitution

Many languages have replace or replaceAll methods that replace the parts of a string that match a regex. Sometimes you will see a g flag on a regex instead of a replaceAll function.

alert("Rascally Rabbit".replace(/[RrLl]/g, "w"));

Components of Regexes

Character Classes

Groups

Defined above, in the section on extraction.

Quantifiers

Generally, 18 types:

EagerReluctantPossessive
Zero or one????+
Zero or more**?*+
One or more++?++
m times{m}{m}?{m}+
At least m times{m,}{m,}?{m,}+
At least m, at most n times{m,n}{m,n}?{m,n}+

Eager (Greedy and Generous) — match as much as possible, but give back

\w+\d\d\w+  // matches abcdef42skjhfskjfhsjdfs
            // but inefficiently

Possessive — match as much as possible, but do NOT give back

\w++\d\d\w+  // does not match abcdef42skjhfskjfhsjdfs
             // but is efficient

Reluctant — match as little as possible

\w+?\d\d\w+  // matches abcdef42skjhfskjfhsjdfs
             // efficiently, yay!

Backreferences

Things captured can be used later:

<(\w+)>[^<]*</\1>

Anchors, Boundaries, Delimiters

Some regex tokens do not consume characters! They just assert the matching engine is at a particular place, so to speak.

Read more about these at Rexegg.

Also, the lookarounds (up next!) don’t consume any characters either!

Lookarounds

Note: Read this awesome article on lookarounds.

Modifiers

A modifier affects the way the rest of the regex is interpreted. Not every language supports all of the modifiers below. For example, JavaScript (officially) supports only i, g, and m.

ModifierMeaning
gglobal
iignore case
mmultiple line
ssingle line (DOTALL): Means that the dot matches any character at all. Without this modifier, the dot matches any character except a newline.
xignore whitespace in the pattern
dUnix line mode: Considers only U+000A as a line separator, rather than U+000D or the U+000D/U+000A combo or even U+2028.
uUnicode case: in this mode the case-insensitive modifier respects Unicode cases; outside of this mode that modifier only consolidates cases of US-ASCII characters.

Performance Pitfalls

You should know some things about how your regex engine works since two "equivalent" regexes can have drastic differences in processing speed.

Performance Tips

Always do the following:

Miscellaneous Language-Specific Notes

A few things that are good to know:

Study and Practice

Here are some good sources: