Demystifying REGEX

Demystifying REGEX

A Regular Expression is simply a sequence of characters that define a search pattern. Regular expressions can be given to languages like Javascript to have it search a string for matches to the regular expression pattern.

This article is an introduction to regular expressions and some of their practical uses. Links to additional materials will be provided to supplement what you learn here.

Here is an example of a regular expression being saved in a variable pattern.

const pattern = /vschool/i;

Let's break this down to understand what it's saying:

  • 1: /vschool/i : This is the regular expression.
  • 2: vschool : This is the pattern we are looking for when we use this regular expression.
  • 3: i : This is a modifier, which gives specific instructions to the search pattern. In this instance, i is saying that our search in not case sensitive.

This is the basic structure when making a regular expression in Javascript. The pattern will be enclosed with //, and anything after the closing / will be the modifiers we want to add to the search pattern.

In order to use a regular expression, you will need to call a method that accepts a regular expression as an argument. Javascript provides a few different methods for this. We will be using .replace() in the next few examples.

.replace() takes two arguments:

  • .replace(pattern, replaceStr)

The first argument is the regular expression (pattern) telling replace what to look for, and the second is argument is what should replace any found matches. Here is an example of replace in action.

// Create the pattern we want .replace() to look for.
const pattern = /okay/;

// Here is the string we will call .replace() on to have replace search for our pattern.
let str = "V School is okay";

// Call replace on the original string and save the result in a new string.
let newStr = str.replace(pattern, 'awesome!'));

console.log(newStr)
// =>  "V School is awesome!"

Pretty cool right! You can also type your regular expression directly in the first argument like this:

let newStr = str.replace(/okay/, 'awesome!');

With that example hopefully you can already see the power of a regular expression. If you don't, try writing a for loop function that does this and you will quickly realize how much work it's doing for you.


Quantifiers

Quantifiers are used to tell our search pattern:

  • Search the string to see if it begins with a certain pattern
  • Search the string to see if it ends with a specific pattern
  • Search the string for a specific sequence.

The three quantifiers we will cover here are the ^, $ and .* . To see others available, visit this W3schools page.

The ^ symbol when used as an quantifier tells our pattern to look for a string starting with whatever comes after the ^ like this:

const str = "hello world";

const result = str.replace(/^hello/, "what's up");
console.log(result);
// => "what's up world"
``` In this example, we are telling our `.replace()` method to search the
given string, and *if* the string begins with 'hello', replace it with 
"what's up".  While using `/hello/` would look for the word 'hello'
anywhere in the string, the `/^hello/` tells the expression to look
specifically at the beginning of the string for this pattern.


Now let's look at how to tell your pattern to look and see if the string ends with a specific word using `$`.

```js
const str = "hello world"

const result = str.replace(/world$/, 'universe');
console.log(result)
// => "hello universe"

Just like in the example with the ^, we are telling .replace() to search the string to see if it ends with 'world'. If it does, we tell it to replace it with the word 'universe'. So to compare again, a expression like this /world/ would look for that word anywhere in the string, but /world$/ tells the expression to look specifically at the end of the given string for this pattern.

Lastly, the .* (period star) quantifier allows you to set a beginning and an ending search parameter when it's used along with the ^ and $. In this example, we are looking to see if the string we are given begins with 'hello' and ends with 'bye'.

const str = "hello and goodbye";
const result = str.match(/^hello.*bye$/g);

console.log(result);
// => [ 'hello and goodbye' ]

Modifiers

So far we have seen the i modifier (case insensitive), which tells the regEx function to find matches in the pattern whether the match is uppercase or lower case. However, regEx functions by default will only look for the first match. If we want the regEx function to search for as many matches as exist in the string, you use the g global modifier. As an example:

const myStr = "Vschool is cooler than cool!";

// Without the /g modifier:
const result = myStr.replace(/ool/, '00L')
console.log(result) 

// => "Vsch00L is cooler than cool!";


// With the /g modifier:
const result2 = myStr.replace(/ool/g, '00L')
console.log(result2)

// => "Vsch00L is c00Ler than c00L!"

Multiple modifiers can exist in the same regular expression. Let's change our string a little to see how this would work using our g & i modifiers together:

// Our previous string modified with mixed uppercase and lowercase.
const myStr = "VschoOl is cOOLer than cOoL!";

const result = myStr.replace(/ool/gi, "00L");
console.log(result);
// => "Vsch00L is c00Ler than c00L!"

As you can see this regular expression found all matches to 'ool' regardless of it's case, and replaced all instances of where it occurred in the string.


Methods

So far we have only used the .replace() method, so here we will go over a few other methods that are used with regular expressions.

.match()

Another method that accepts a regular expression as a parameter is the .match() method. .match() takes a single argument which is the regEx pattern you want to to search for in a string. If .match() finds any matches, it will return an array of the match. If no match is found, .match() will return null.

.match() will return a different type of array depending on whether you tell it to look globally with the g modifier, or if you just tell it to find the first match. If you do not include the g modifier, .match() will return and array object containing some information about our match search. Let's see an example of this:

const str = "wubba lubba dub dub!";
const result = str.match(/ub/);

console.log(result);
// => ["ub", index: 1, input: "wubba lubba dub dub!", groups: undefined]

As you can see, .match() returned an array object where

  • result[0] is the first found instance of the regExp patter we gave.
  • result[1] is the index in where the match first occurred.
  • result[2] is the input string we gave to .match().
  • result[3] is a 'groups' key that can be set to a data type.

This is useful for finding the first match of a pattern in a string as you then have information like result.index that shows where it occurred. Now lets use the same example as before but this time include the g modifier.

const str = "wubba lubba dub dub!";
const result = str.match(/ub/g); // <- including /g modifier

console.log(result);
// => [ 'ub', 'ub', 'ub', 'ub' ]

You can see that if you give .match() the g modifier, it will return an array of all found matches.

.exec()

The .exec() regEx method is very similar to the .match() method in that it returns an array of the found matches. If a single match is found, it returns an array object that looks exactly the same as the array object we saw with .match() previously. The syntax is the big difference as you call the .exec() method on the pattern you want it to search for, and then the string to search is put in between the ().

const str = "wubba lubba dub dub!";
const pattern = /wu/g;

const result = pattern.exec(str);
console.log(result)
// => ["wu", index: 0, input: "wubba lubba dub dub!", groups: undefined]

You'll notice that .exec() ignored the g modifier as it's function is to return only the first match it finds. If no match was found, .exec() would return null

.test()

The last method we'll cover is the .test() method. .test() has the exact same syntax as the .exec() method:

pattern.test(str)

The big difference is that .test() returns a boolean value telling you whether it found a match or not.

const str = "wubba lubba dub dub!";
const pattern = /wu/;

const result = pattern.test(str);
console.log(result)
// => true

[Brackets]

When you use [] in a regular expression, you are giving your pattern a specific range to look for. Here is a simple example:

const str = "wubba lubba dub dub";

const result = str.match(/[abc]/g);
console.log(result)
// => [ 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'b' ]

As you can see, this had our match method look for all instances of the letters a, b, and c. The real power of [] comes when you include a - (dash) to give the expression a range to look for.

const str = "wubba lubba dub dub";

const result = str.match(/[a-c]/g);
console.log(result)
// => [ 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'b' ]

You'll notice we got the exact same output as before. The difference is that we used the - to tell the expression to look for any letters from a to c. When using ranges, be sure to include the g modifier so that the function does not stop after it finds the first match.

Here is a short list of ranges you can use that help show the usefulness of this:

  • [a-z]: Looks for matches from lowercase a to lowercase z.
  • [A-Z]: Looks for matches from uppercase A to lowercase Z. (You could also just use the i modifier, and it would look like this: /[a-z]/gi.
  • [0-9]: Looks for numerical characters in your string.

Often when you are working in javascript, you use the ! to specify the opposite of whatever comes after the !. For example:

let alive = true;

if(!alive){
   // do something when alive === false;
}

Our if statement is checking if not alive because of the !.

Regular expressions have this same functionality within [] using the ^ symbol (shift + 6). When the ^ symbol is put in [], you are telling the expression to return all characters that do not match what is in the [].

As you now see, certain characters like ^ are used in different ways within a regex pattern. The use we are covering now is specific to when ^ exists inside of [].

const str = "I like 1 and 2, but not 3!";

const result = str.match(/[^a-z]/gi);
console.log(result);
// =>[ '1', ',', ' ', '2', ',', ' ', ' ', '3', '!' ]

You can see here that our match looked for all characters that did not match an uppercase or lowercase letter of the alphabet. Lets modify this to have it also ignore the , ! and white space.

const str = "I like 1 and 2, but not 3!";

const result = str.match(/[^a-z, !]/gi);
console.log(result);
// => [ '1', '2', '3' ]

Note that the , white space and ! do not have to be in any specific order. Since they are in the [] and come after the ^, they are included in the search parameters for what to ignore.

Here is an example using the ^ with ranges to find all special characters in a string.

const str = "! abc XYZ &#* cl*902-=_'"

const result = str.match(/[^a-z0-9 ]/gi)
console.log(result);
// => [ '!', '&', '#', '*', '*', '-', '=', '_' ]

(Paren | thesis)

While there can be other uses, a big reason to use a set of () in a regular expression is to give your pattern an or statement. Just like you would use || in Javascript, you use a single | in a regular expression.

Let's see an example of this to help clarify what it's use is. In this example, we want to find all instances of the word got, git, and gut, but ignore all others.

const str = "git gat got gen gut";

const result = str.match(/g(i|o|u)t/g);
console.log(result);
// => [ 'git', 'got', 'gut' ]

This is our pattern: /g(i|o|u)t/g.

  • /g : We are saying look for something starting with the letter g.
  • (i|o|u): This says to check if either an i, o, or u come directly after the g it found.
  • t/g: This is saying that if you found a combination of gi, go or gu, check to see if it ends with a t, and to do this search globally for all matches.
  • This is the same as searching for /git/, /got/, /gut/ in a single expression.

You can use a single (x|y), or chain as many or statements you need such as (w|x|y|z).


Metacharacters

As you may have seen on stackoverflow or other sites, many regular expressions will have a \ directly after the beginning / like this:

const pattern = /\d/g;

Don't let the word 'metacharacter' scare you, it's just the term used to specify a character with special meaning. In this example, the d is the metacharacter being used, and the \ is saying that the letter immediately following this is a metacharacter.

The metacharacters we will cover are ., \w, \W, \d, \D, \s and \S. To see a comprehensive list of metacharacters available, check out W3's article here.

All metacharacters will need the /g modifier to search the entire string rather than just stopping at the first match. However, metacharacters are often used to find a single match, so in that case you would leave off the /g.

You might be saying "hey, you just said that you know it's a metacharacter by the \ before it. While this is true for almost all metacharacters (mc's), the . is an exception.

. :
This mc is a placeholder for any character to fill in.

const str = "hit hot had hip h$t";

const result = str.match(/h.t/g);
console.log(result);
// => [ 'hit', 'hot', 'h$t' ]

This pattern told our .match method to find all words that start with h, end with t, and have any character in the middle. Note, multiple ... can be used in a row to specify the amount of placeholders you are okay with.

\w: This says to find all 'word' characters, which are a-z, A-Z, 0-9 and _ (underscore).

\W: This says to find all 'non-word' characters, which are the characters not included in the \w definition such as a space.

const str = "Hello! 900_!$*%";

// Replace all 'word' characters with an empty space.
const result = str.replace(/\w/g, '');
console.log(result);
// => "! !$*%"

// Replace all 'non-word' characters with an empty space.
const result2 = str.replace(/\W/g, '');
console.log(result);
// => "Hello900_"

\d : This says to find a digit character from 0-9.

\D : This says to find a non-digit character.

const str = "I have a 3, 5, and a Jack!"

// Find all digit characters in the string.
const result = str.match(/\d/g);
console.log(result);
// => [ '3', '5' ]


const str2 = "!1208hi0204!"

// Find all non-digit characters in the string.
const result2 = str2.match(/\D/g);
console.log(result2);
// => [ '!', 'h', 'i', '!' ]

\s: This says to find white space characters.

\S: This says to find everything but white space characters.

const str = "I have spaces"

const result = str.match(/\s/g);
console.log(result);
// => [ " ", " " ]

const str = "I have spaces"

const result = str.match(/\S/g);
console.log(result);
// => [ 'I', 'h', 'a', 'v', 'e', 's', 'p', 'a', 'c', 'e', 's' ]

Just like anything else when it comes to programming, regular expressions are going to be hard until you use them a lot an get use to how their syntax looks/works. I would encourage you to solve problems with both string and array methods, but then to also try and solve them with a regular expression.