Systems Stuff: regular expressions, or: how I learned to stop worrying and love our near-future robot overlords

the stuff of magic

regular expressions have always amazed me. if there's one thing that's magical about programming, it's regular expressions. they can literally pull apart the written language of humans in elegant ways.

you may have heard of them, but maybe not. put simply, a regular expression (aka regex or regexp) is a way of taking apart a string of text and finding and/or extracting exactly what you want. it's very, very powerful, and they can be very intimidating to look at.

a regular expression is technically its own programming language, or better put, it's a parsing language. it goes through the text you provide to it and it parses out whatever you specify.

across the board, the core functionality of regular expressions are the same from language to language, but there are a few idiosyncrasies here and there. for the purposes of this guide, everything I show you should work similarly in PHP, Perl, Javascript, Ruby, and Python. However, I'm going to use Javascript as the primary example language, since it treats regex as its own object.

regular expressions are built into most every programming language in one way or another, so learning it once will help you most anywhere you are.

what for

regular expressions have a huge range of uses. it's one of the most powerful tools in the ol' nerd utility belt. but what are some practical uses?

searching through text for a specific string
replacing a specific string with another string
counting the number of times a string is within a block of text
making sure a person submitted an actual email address into a form
normalizing a phone number whether it was written as 555.555.5555 or (555) 555-5555 or 555-555-5555

pretty much anytime you need to do something specific with text, you'll probably need a regular expression. regular expressions can turn complex convoluted functions into one single regular expression clause.

here's a very simple regular expression. remember, this is javascript, as an example.

var sometext = "Hello there!";
var regex = /there/;
if (regex.test(sometext)) {
  alert('oh boy!');
} else {
  alert('nope.');
}

so what are you seeing? first, we make a variable with just a string. that's fine. but then after that we have another variable -- and it's surrounded by /s?

traditionally a regular expression is contained within / characters. they denote the beginning and end of the expression. in other languages (like PHP) the regex is contained within a string, but it'll still have the starting / and the ending /

right now, the regular expression is simply the word there. that's it. nothing special about it.

the if statement uses the RegExp object's test() method, which tests the regular expression on the provided string. So we gave it the first variable and said "does the regular expression return anything?"

Of course, the word there is indeed inside the string Hello there! so if you ran this in a browser, you'd see the Oh boy! alert box.

let's go deeper

so yeah, using just text inside a regular expression is fine, but that's not very interesting. you begin to unlock the true power of regex when you start using metacharacters, quantifiers, modifiers, and brackets.

we'll look at them one at a time! they're so exciting. let's start with something easy: modifiers. these affect the whole expression. let's try a different example than above:

var sometext = "Hello there!";
var regex = /hello/;
if (regex.test(sometext)) {
  alert('oh boy!');
} else {
  alert('nope.');
}

ok so all we changed was the regex from there to hello... if you run it, it won't work! Why? well, there is no hello in that string. There's a Hello... which has a capitalized H. so it's not the same. haha.

but say we don't care about case-sensitivity (case-sensitive means every character needs to be exactly-so) and we just want it to match either hello or Hello or even heLLO...

all you need to do is add the case insensitive modifier to the end of the expression:

var sometext = "Hello there!";
var regex = /hello/i;
if (regex.test(sometext)) {
  alert('oh boy!');
} else {
  alert('nope.');
}

haha, now it'll work. all we added, if you noticed, was the i to the end of the regular expression, outside of the expression's final /. that's where modifiers go!

the two other popular ones are g and m... g means global, meaning it'll match all of them as opposed to stopping at the first. i'll show you what i mean in a minute. m means multiline, so one expression can go across multiple lines of text if there are line-breaks.

here's another very basic regular expression example:

var sometext = "Haha well aren't we happy.";
var regex = /ha/ig;
var newtext = sometext.replace(regex, "lo");
alert(newtext);

so you can see, we have a string of text, then a regular expression, which simply looks for the text ha and it's case-insensitive and global, meaning it'll look for every instance of ha and it won't care about the case.

if you run that, you'll get back the text lolo well aren't we loppy. haha... good example. it used the regular expression to find what parts of the text it should replace with the new text "lo", get it?

so those are modifiers.

metacharacters

this are where things get crazy. if you read the PHP guide, you might remember me talking about escape characters, so that you can have a newline \n character inside a string (instead of having an actual new line). a metacharacter works much the same way.

metacharacters simply show what cannot be shown, or act as shortcuts to larger sets of text. for example, the \d metacharacter represents all digits. so if you just want to select all of the numbers in a string, you could easily use that to match them all.

every metacharacter starts with that slash and then another character. here are some popular ones:

\w and \W represent any word character and any non-word character, respectively.
\d and \D represent any digit character and any non-digit character, respectively
\b represents the beginning or end of a word
\n represents a new line
\s and \S represent any whitespace character and any non-whitespace character, respectively
^ means the very beginning of the string
$ means the very end of the string!

so, for example, let's say we want to find the text "cat" inside a string:

var test_string_one = "cats cat caterpillar";
var regex_one = /cat/gi;

that regex, for that string, would match every instance of the text cat, no matter where it is. what if we wanted just the word "cat"?

var test_string_one = "cats cat caterpillar";
var regex_one = /\bcat\b/gi;

if we surround the text with the \b metacharacter, it'll make sure that it only matches the text if "cat" is its own word!

something that always scared me (haha, yes) was that the \b and the cat were touching... that sounds strange, but it seemed weird to me. i always read it like bcat? uhhh...

but take a moment to remember that this isn't for your eyes, this is for a machine to understand, and the parser will read the regular expression one character at a time. so don't be afraid of how strange the regular expression looks! it all looks comprehensible to our robot overlords...

another metacharacter example:

var sometext = " this has   a       lot of    whitespace     ";
var regex = /\s/g;
var newtext = sometext.replace(regex, "");

that string has a lot of blank whitespace. kinda ugly. how about we just kill all the whitespace! the regex, you'll notice, just has a single metacharacter, the whitespace one, and it's gonna look for it globally. with the replace() method we find all instances of whitespace and then replace them with nothingness!

kinda weird thing to do, but I haven't taught you quantifiers yet, so that's as basic as it can be. what if we want to replace those long sections of spaces with just a single space?

quantifiers

say we wanted to match not just text or a metacharacter, but a series of characters. let's use the above example again.

quantifiers simply specify how many times to look for the previous character... let me show you!

var sometext = " this has   a       lot of    whitespace     ";
var regex = /\s+/g;
var newtext = sometext.replace(regex, " ");

I changed two things: I added a + after the \s metacharacter, and I put a single space as the replacement string. what will this do? it'll find blocks of whitespace at least one character in length and replace them with a single space. CRAZY AWESOME.

as I said, the quantifier acts on whatever token (another word for character or set of characters) came before it. in this case, the \s token came before the plus sign, so the plus sign quantifier acts on that. it'll match one and then however many follow it.

here are the basic quantifiers:

+ finds at least one of the preceding token
* finds zero or more of the preceding token
? finds one or none of the preceding token
{x} finds x amount of the preceding token
{x,y} finds between x and y amount of the preceding token

here are some examples!

/hello+/i

will match:
hello
Hello
heLLo
HELLOOOOOOOO

because the + means at least one of the preceding token (which was just the character o), you can have as many as you want and it'll still match

/colou?r/i

will match:
color
colour

because we put a ? before the u, it'll match the string, whether the u is there or not! wow.

/hel{2,3}o/i

will match:
hello
helllo

but will not match:
helo
hellllo

because we asked it to match exactly two or three ls between the e and the o.

groups and brackets

so you may be asking, what if I want to use a quantifier on a group of characters? right now all you know is that it works on the preceding token!

that's where groups and brackets come in, because they act as their own as a singular tokens the same as an individual character would.

so for an easy example, let's just show a group:

/l(ol)+/i

will match:
lol
lololololol
LOLOL

magical! so we enclose part of the text in parentheses and it'll act like a single token! so when you use the + quantifier, it's checking for everything within the group to be repeated.

brackets are similar, but a bit more abstract. I'll show you one:

/i feel [a-z]+/i

will match:
i feel good
i feed bad
i feel great

brackets are used to contain a range of possible characters. so in this example, it fits anything between a and z and the + quantifier after it says there can be one or more!

I should tell you now that groups inside parentheses also can act as sub matches in certain situations... for example:

/^i feel ([a-z]+)$/i

notice I denoted the beginning and end of the string with the ^ and $ metacharacters, just to make this airtight. that means that the string must start with i and must end after whatever the captured group is. I grouped the a to z range and its quantifier. when using certain functions, like PHP's preg_match() function or Javascript's match() method, this'll allow you to select not just the whole expression, but those sub-groups.

So if you ran PHP's preg_match() with that regex and gave it the string I feel great, it would capture the whole thing, but it would also capture just the word great. Very useful!

and you can put even more stuff within brackets together. for instance:

/i feel [a-z0-9_!]+/i

that'll match anything between a and z as well as anything between 0 and 9 and also any underscores or exclamation points!

you should remember also that you could shorten what's in the brackets by replacing the need for the a-z and 0-9 ranges with just the \w metacharacter, like so:

/i feel [\w_!]+/i

will match:
i feel great!
i feel fine
i feel __okay__

that does the same thing! remember, most metacharacters are shortcuts. the \w represents anything a-z, A-Z, and 0-9. magical!

within groups you can also use the | pipe character to separate possibilities. so if you want to match one of three colors, you'd do it like this:

/i like (red|green|blue)/i

will match:
i like red
i like blue

will NOT match:
i like yellow

haha, so if you're looking for something within a specific group of possible values, you can do it that way within the regex itself.

a much more complicated (but very practical) example

how I usually think of regular expressions requires me to think like a computer. it really makes you need to break down language itself into its smallest parts.

here's a common regex:

/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

Oh jeez look at how complicated it is! What does that do? I'll just tell you: it matches any email address! Let's break it apart...

obviously the first and last characters, ^ and $, signify the beginning and end of the string. And it has the i case-insensitive modifier for the whole thing.

so let's take those out. now let's focus on the first part:

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}

that's not so bad: it's just a range of possible characters, specifically A-Z and 0-9, and then periods and underscores and whatnot. Notice that the plus sign is in there but it does not act as a quantifier, because it's within brackets. However, if you were searching for a + sign in the text, you'd need to escape it... but we'll get back to that later.

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}

next we have just the @ symbol, because every email address has that!

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}

and after the @ an email address always has a domain name! And that domain name can only contain alphanumeric text, a period, or a hyphen. Nothing too special here. But this only covers the beginning of the domain name, so if it was google.com, this part would only cover the google part.

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}

this may seem a little puzzling. in a regular expression, the period . is a metacharacter which represents any character, so in order to use a period as just a period, we need to escape it by putting a \ before it! okay so now we got the period in the domain name, moving on.

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}

so what's between two and four characters and can only be letters? com, net, org, edu... the end of the domain name!

here's the full regex again:

/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i

now take your own email address and try to follow along to see how they match. it's magical.

here's the whole thing matched out more obviously:

/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i cyle_gage@emerson.edu

the conclusions, the secrets

there are two main challenges to writing good regular expressions:

being specific in knowing what you want
allowing possibilities

you have to be as specific as the results you're expecting, yet you have to be broad enough to allow a range of possibilities. once you get the hang of it, though, it's pretty simple.

I highly recommend this online tool to test out your regular expressions. It's flash-based so it has a couple quirks, but it's very thorough, and shows you results as you type.

there's a lot more you can do with regular expressions, but these are the most useful basics. I don't really stray from what I've taught you here very often, and they've gotten me some great results.

and now you can wear this shirt and really nerd yourself out!