WTF is a Regex?

By definition, regex, which is short for regular expression, is a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.

Author:

Renato

Category: Development

Where or when would I use a regex?

Regular expressions are usually used when you want to check if a string matches a certain pattern. For example, you could use a regex to check if a string is a valid email address, if a username has unallowed characters, get all the words starting with a specific letter, etc.

How exactly does regex work?

As the definition says, regex is a pattern that you want to test against. Let’s say you want to check if a string is a valid email. You probably know that, in a broad sense, email has to have a word with a possibility of some special characters, followed by an @ (monkey), a word again, a . (dot) and a word again. This “description” would be written with a regex.

How do you define a regex in JavaScript?

In JavaScript, regex is defined with two / (slash) symbols, at the beginning and at the end of a regex. Another way is to use a RegExp constructor and pass a regex to it as a string. Throughout this article, we will use JavaScript / notation.

Does each character represent itself?

Each (non special) character represents itself. That means that letters a and B represent themselves, number 9 represents itself, etc.

🔮 Regex	✅ Match	⛔️ No Match
/dog/	dog	Dog, dOg, cat, …
/high5/	high5	High5, high 5, highfive, …

Special characters and flags extend regex abilities and they are what make regexes such a powerful tool.

Square brackets ([])

Square brackets represent a character set, and are used to match a single character or a range of characters that can be in that particular place. Range of characters can be either a range of letters (lowercase or uppercase) and a range of numbers. Range is denoted with the - (dash).

Note: [] escapes most of the special characters i.e. they are treated as a normal character.

🔮 Regex	✅ Match	⛔️ No Match
/[a1]/	a, 1	a1, b, c, …
/[ab][12]/	a1, a2, b1, b2	1a, 2b, ab, 12, …
/[a-zA-F][0-9]/	a0, F1, z9, …	z, 0, 9A, Z0, …

Pattern in square brackets can be prefixed with the ^ (circumflex) symbol. It works like a NOT operator and means match any character, except the ones in the pattern.

🔮 Regex	✅ Match	⛔️ No Match
/[^a]/	#, B, 0, q, …	a
/[^0-9][0-9]/	C4, !1, a5, …	01, 1a, 25, …

Dot (.)

Dot is used to match any character (except new line, but can be changed with a flag).

🔮 Regex	✅ Match	⛔️ No Match
.	a, %, ., \, …	N/A
..	xx, Y#, 12, …	N/A
[A-Z].[0-9]	A%0, Zz9, CC0, …	000, zz9, ###, …
[.].[^.]	.ab, .#c, .a2, …	abc, ..., ab., …

Pipe (|)

Pipe is used as an OR (alternation) operator i.e. match either left or right side.

🔮 Regex	✅ Match	⛔️ No Match
/a\\|b/	a, b	c, 1, …
/john\\|mike\\|dex/	john, mike, dex	michael, jesse, 123, j, …
/[0-9]a\\|b/	b, 1a, 9a, …	a, 0b, …

Plus (+), asterisk (*), curly brackets ({}), question mark (?)

Plus, asterisk, curly brackets and the question mark are part of the special character set called quantifiers. They indicate that a certain character (or a range of characters) must be matched a certain number of times. They are greedy, and will always try to match as much as possible.

+ matches 1 or more characters.

* matches 0 or more characters.

{} match the specified quantity. For example, {2,5} will match between 2 and 5 characters, {6} will match exactly 6, {4,} will match 4 or more.

? matches 0 or 1 character. However, since quantifiers are greedy, if used with other quantifiers, it will behave like a lazy operator, meaning that the preceding quantifier will match as few characters as possible.

🔮 Regex	✅ Match	⛔️ No Match
/do+dle/	dodle, doodle, dooodle, …	ddle, …
/[0-9]{3}-[0-9]{4}/	123-4567, 666-6666, …	1-2, 1111-234, …
/https?/	http, https	anything else

Parentheses (())

Parentheses are used for capturing groups. They group multiple parts of an expression together and create a capture group for extracting a substring. Each group can be later accessed as a separate entity.

In order to exclude certain group i.e. group multiple parts of an expression without creating a capture group, group can be prefixed with ?: (non-capturing group).

🔮 Regex	✅ Match	🧩 Group 1	🧩 Group 2
/(ab)c/	abc	ab	N/A
/([0-9])([a-z])/	1c, 2b, …	1, 2, …	c, b, …
/(?:[0-9])([a-z])/	1c, 2b, …	c, b, …	N/A

Word (\w), digit (\d) and whitespace (\s)

Regular expressions become really complex really fast, even when the logic is relatively simple. That’s why there are special characters that are kind of abbreviations for character sets.

\w will match any word character. That includes letters, numbers and underscores. An alternative expression is /[A-Za-z0-9_]/. Its negative counterpart is \W and will match any non-word character.

\d will match any number character. An alternative expression is /[0-9]/ and its negative counterpart is \D.

\s will match any whitespace character, including spaces, tabs and line breaks. Its negative counterpart is \S.

🔮 Regex	✅ Match	⛔️ No Match
/\w/	a, 0, _, …	#, -, …
/[A-Z]\w+/	Regex, Wo_rd, Alph4	test, 4lpha, 123, …
/\d{3,}/	123, 1443, 11221, …	2, 31, …

Word boundary (\b)

Word boundary matches a position between a word (\w) and a non-word character or position (start and end of string included). For example, expression /y\b/ for the string yay yoy will match only (circled) (y) positions: ya(y) yo(y). Its negative counterpart is \B, and, for the same example, would match (y)ay (y)oy positions.

Beginning and end anchors (^ and $)

Beginning and end anchors, as they name suggest, match the beginning and the end of a string.

🔮 Regex	📄 String	✅ Match
/^\w+/	hello world	hello
/\d+$/	123 hello high5	5

Backslash (\)

Backslash is used to escape any special character in order to match the literal character it stands for.

🔮 Regex	✅ Match	⛔️ No Match
/\[a-z\]/	[a-z]	c, e, …
/wow\$$/	wow$	wow, wow$w, …

Flags (i, g, m, s)

Flags are special attributes that are applied to a whole expression. In JavaScript, they are added at the end, after the second forward slash (/).

i (ignore case) makes the whole expression case-insensitive. That means that, for example, /[a-z]/i will also match uppercase letters.

g (global) returns all matches i.e. doesn’t stop after the first match. It’s typically used when one needs to find and/or replace all occurrences.

m (multiline) makes beginning and end anchors (^ and $) match the start and end of line, instead of the start and end of the whole string.

s (dotall) makes dot (.) match newline character as well.

JavaScript and Regex

There are two ways of creating a regex in JavaScript.

The first one is using a literal syntax:

const regex = /\d+\w+/i;

The second one is using RegExp constructor:

const regex = new RegExp('\d+\w+', 'i');

Note: Since ES6, RegExp constructor accepts regex literals as well.

Example: Check if valid username is provided

We need to check if the user provided a valid username. Username is valid if it’s between 4 and 15 characters long and it contains only letters, numbers, dashes or underscores.

Let’s construct the regex first. Since we want to check if the entire string is a valid username, we’ll add ^ and $ to match the start and end of the string. Next, we’ll use \w to match letters and underscores, \d to match digits and - to match exactly that - a dash. We’ll also add {4,15} quantifiers to limit the username validity between 4 and 15 characters.

function isUsernameValid(username) {
  const regex = /^[\w\d-]{4,15}$/;
  return regex.test(username);
}
console.log(isUsernameValid('john_Doe')); // true
console.log(isUsernameValid('@mike')); // false

Example: Extract class names from an HTML string

Let’s say we have an HTML string like this:

<div class="wrapper grid">
  <div class="column column-left" data-type="large">
    <h1 class="title">Hello World!</h1>
  </div>
  <div class="column column-right">
    <p>Lorem ipsum dolor sit amet.</p>
  </div>
</div>

We want a function that will return an array of unique class names:

['wrapper', 'grid', 'column', 'column-left', 'title', 'column-right']

Let’s first write a regex that will help us get all class values. When we look at the HTML code, we can see that classes are declared using a class attribute. class keyword is followed by = (equal) sign and then the classes are declared between quotes. At the first glance, you might be tempted to write something like this:

/class="(.+)"/g

If you only had HTML with class attributes only, this would technically be correct. However, if you tried to execute this, you would see that the second match would return:

column column-left" data-type="large

Why? Because regex said “match anything between two quotes”, and it did, but not between the quotes you wanted. :) When writing regexes, whenever you can, try to be as explicit as possible. Let’s fix this by saying “match anything except a double quote symbol”:

/class="([^"]+)"/g

In order to iterate through matches and capture groups, we need to use .exec() method on the string. If a match is found, .exec() returns an array, where the first item is a full match, and next n items are the capturing groups.

function getClassNames(html) {
  const regex = /class="([^"]+)"/g;
  const classes = [];
  let match;
  while((match = regex.exec(html)) !== null) {
    const className = match[1];
    const unique = className.trim()
      .split(/\s+/)
      .filter(c => !classes.includes(c));
    classes.push(...unique);
  }
  return classes;
}

Pro-tip by the author

When writing a regex, always start by writing out several cases that should be matched, and several that shouldn’t. That way, you have a better overview of the logic and will be able to cover more edge cases. Also, as always - practice makes perfect!