Introduction to regular expressions

October 8, 2015

Hey all! Today we’re going to cover a gentle introduction to regular expressions. We’ll use JavaScript as our example language although almost everything we cover will work in any modern language.

Regular Expressions with JavaScript

JavaScript Regular Expressions come with a bunch of handy methods but today we are going to be focusing on the match method for testing whether a string matches the regex patterns we’ll be creating.

To play around with regular expressions interactively, we can use the console or a service like JS Bin and some code like this:

var string, pattern;
console.log(string.match(pattern));

So, using that simple strategy, let’s get started!

String Literals

The very first thing we’ll learn how to do is match for a literal string. This means checking a string like “hello” and matching literally just the word “hello”. That looks something like this:

var string = "hello";
var pattern = /hello/;

console.log(string.match(pattern)); // returns ["hello"]

Similarly, searching for a string literal that does not exist will return null.

var string = "hello";
var pattern = /goodbye/;

console.log(string.match(pattern)); // returns null

Either or

Sticking with string literals, let’s say we want to match if a string includes either one word or another. We can do this by grouping the string literals together and using a single pipe as an OR operator.

var string = "hello";
var pattern = /(goodbye|hello)/;

console.log(string.match(pattern)); // returns ["hello", "hello"]

Just like before, if the string doesn’t contain either of our string literals, it returns null.

var string = "foo";
var pattern = /(goodbye|hello)/;

console.log(string.match(pattern)); // returns null

Multiple results

Some of you may have noticed that while the or operator returned a match like we wanted, it also returned an Array with two entries.

This is because any time you group any literal or patterns together using parenthesis you are also telling the regex engine to “remember” the match. Meaning that it will, if matched, add another entry to your array.

In order to tell the regex engine to “forget” the match, you can always prepend a group with “?:” like so:

var string = "hello";
var pattern = /(?:goodbye|hello)/;

console.log(string.match(pattern)); // returns ["hello"]

Any Character

More often than not, we don’t know the exact word we’re looking for. Frequently we need to match on something less specific, like a letter or number.

Match a Letter

Let’s say we want to match on any single letter “a” through “z”. We can do a little something like this:

var string = "r";
var pattern = /[a-z]/;

console.log(string.match(pattern)); // returns ["r"]

But let’s see what happens if the letter is capitalized.

var string = "R";
var pattern = /[a-z]/;

console.log(string.match(pattern)); // returns null

Luckily, we have a great option for handling this. Using “flags” we gain a lot of flexibility with our regex searches. For this case, we can add an “i” flag which makes our search case insensitive.

var string = "R";
var pattern = /[a-z]/i;

console.log(string.match(pattern)); // returns ["R"]

Match a Number

Sometimes, it’s a number we need to find instead of a letter. Very similar to the example above we can use a range like [0-9] to match a number instead of [a-z] for a letter.

var string = "7";
var pattern = /[0-9]/;

console.log(string.match(pattern)); // returns ["7"]

Match a Letter or Number

As you might imagine, sometimes we need to match on a single character whether it’s a number or a letter. Fortunately, regular expressions provide us with a character for matching any alphanumeric character. It looks like “w”.

var string1 = "7";
var string2 = "r";
var pattern = /w/;

console.log(string1.match(pattern)); // returns ["7"]
console.log(string2.match(pattern)); // returns ["r"]

Before we get too much further, don’t worry about memorizing all of these flags like “w” and “/i”. They are all nicely documented over at the Mozilla Developer Network.

Match a number of characters

Any number of characters

The following techniques will work with a letter search “[a-z]”, a number search “[0-9]” or a alphanumeric search “w”.

Let’s say we want to match any number of letters that are next to each other. We can use the “+” character to match one or more of the pattern before it. So to match one or more lowercase letters we can do:

var string = "baz";
var pattern = /[a-z]+/;

console.log(string.match(pattern)); // returns ["baz"]

We could apply the same logic to [0-9] or w.

Specific number of characters

Other times we want to match only if there are a specific number of characters in a row. Let’s say we were looking for the string “www” to identify a web address. We could use a string literal like “/www/” but let’s check out a cleaner way of doing it.

We can match the letter “w” 3 times by using curly braces which allow you to pass in two numbers for minimum and maximum length or a single number for the exact number of occurrences. Let’s check out some examples to better understand.

var string1 = "www";
var string2 = "wwww";
var string3 = "wwwwww";

var pattern1 = /w{3}/;
var pattern2 = /w{2, 5}/;


// Matches exactly 3 w's
console.log(string1.match(pattern1)); // returns ["www"]

// Matches the first 3 w's
console.log(string2.match(pattern1)); // returns ["www"]

// Matches between 2 and 5 w's
console.log(string3.match(pattern2)); // returns null

Special Characters

If we search for “.” as a string literal, we’ll get false positives like this:

var string1 = "r";
var string2 = ".";

var pattern = /./;


console.log(string1.match(pattern)); // returns ["r"]
console.log(string2.match(pattern)); // returns ["."]

This is because there are special characters in regex which we might also want to search for. In this case, “.” is a wildcard so if we want the literal character “.” we’ll need to escape it.

var string1 = "r";
var string2 = ".";

var pattern = /./;


console.log(string1.match(pattern)); // returns null
console.log(string2.match(pattern)); // returns ["."]

Additionally, it’s common that we need a pattern which includes a white space. Let’s say we’re looking for the words “web design”.

var string = "web design";

var pattern = /w+/;


console.log(string.match(pattern)); // returns ["web"]

As we can see, our match fails as soon as it hits a space. For these situations we can use “s” to account for a white space character.

var string = "web design";

var pattern = /w+sw+/;

// This will match any number of characters
// plus a white space
// followed by any other number of characters
console.log(string.match(pattern)); // returns ["web design"]

Begins With

Frequently we find ourselves only wanting to match a pattern if the string in question begins that way. Let’s say for example we still want to match “www” but only if that’s the very beginning of the string.

Here we can use regular expression boundaries. There are four total boundaries that you can use but today we’re only going to focus on the beginning of an input and the end of an input.

To find a “www” only if it’s the beginning of the input we can use the “^” boundary for the beginning of an input.

var string1 = "www.google.com";
var string2 = "google.com/www";

var pattern = /^w{3}/;


console.log(string1.match(pattern)); // returns ["www"]
console.log(string2.match(pattern)); // returns null

Ends With

Very similarly, sometimes we need to match a pattern only if it’s at the end of an input string. For this example, let’s say we want to find strings that end with either “.com” or “.org”. Let’s combine the or operator from earlier with the ends with boundary “$”.

var string1 = "google.com";
var string2 = "google.org";
var string3 = "someorg.edu";

var pattern = /(?:com|org)$/;


console.log(string1.match(pattern)); // returns ["com"]
console.log(string2.match(pattern)); // returns ["org"]
console.log(string3.match(pattern)); // returns null

Single Word

Let’s say now we want to match a string literal but only if it’s in a word by itself. We can use the zero-width word boundary “b” for this. Let’s check out some examples.

var string1 = "or";
var string2 = "poor";
var string3 = "organism";

var pattern = /borb/;


console.log(string1.match(pattern)); // returns ["or"]
console.log(string2.match(pattern)); // returns null
console.log(string3.match(pattern)); // returns null

Optional

Another frequent occurrence is searching for optional strings. Going back to our “www” example, we know that some websites use www. and some don’t. Let’s use the optional qualifier “?” in addition to the literal “.” assertion “.” to match websites whether they start with “www” or not.

var string1 = "www.google.com";
var string2 = "google.com";

// optional www. followed by google.com literal
var pattern = /(?:www.)?(?:google.com)/;


console.log(string1.match(pattern)); // returns ["www.google.com"]
console.log(string2.match(pattern)); // returns ["google.com"]

The last thing we’ll cover today is how to handle a global search. Say we have a list of websites in a multiline string. Let’s combine a lot of what we’ve learned so far. We want a pattern that:

  • Appears anywhere in the multiline string
  • Is the beginning of a new line
  • Optionally starts with “www.”
  • Followed by any number of alphanumeric characters
  • Ends with a “.com”
var string = "www.google.com nwww.facebook.com ntwitter.com";

var pattern = /^(www.)?w+.com/gm;


console.log(string.match(pattern)); // ["www.google.com", "www.facebook.com", "twitter.com"]