Regex Patterns

Regex Patterns#

Regular expressions are constructed from a sequence of atoms, that can consist of literal characters or meta-characters with special meaning. Let’s start by looking into the list of commonly used characters and regexp patterns.

Literal characters#

Most characters simply match themselves.

a       # matches 'cat' but not 'dog' or 'CAT'

Non-printable or white-space characters can be matched with their backslash notation.

\t      # matches tab
\n      # matches newline
\r      # matches carriage return

Meta-characters#

The meta-characters . * + ? ^ $ { } [ ] \ | ( ) have a special meaning.

.       # matches any character except newline
a|b     # matches 'a' or 'b'
()      # used for precedence or capturing groups

They need to be escaped to match the literal character.

\.      # matches any dot character
\\      # matches any backslash character

Character classes#

A character class specifies the set of characters that should be matched.

[chars]   # any char from given set or range
[^chars]  # any char not in given set or range

Only ] ^ - \ are treated as special in character classes, the rest are literal (eg []x] matches ] or x). The - is treated as literal if it’s the first or last character.

Shorthand character classes#

Abbreviations exist for frequently used character classes.

\d      # any digit = [0-9]
\D      # any non-digit = [^0-9]
\w      # any alphanumeric plus underscore = [a-zA-Z0-9_]
\W      # any non-alphanumeric = [^a-zA-Z0-9_]
\s      # any whitespace char = [ \f\t\n\r\v] (space, form feed, tab, newline, carriage return, vertical tab)
\S      # any non-whitespace char

Anchors and other zero-width assertions#

Zero-width assertions do not match any specific character, but rather a position before, after or between characters.

^       # beginning of a line
$       # end of a line
\A      # start of string
\Z      # end of string
\b      # any boundary between a word char \w and a non-word char \W (eg ',.!?)
\B      # any non-word-boundary

Quantifiers#

Quantifiers allow to specify how often a given portion of the regexp can (or must) be repeated.

*       # zero or more occurrences of preceding element
?       # zero or one occurrence of preceding element
+       # one or more occurrences of preceding element
{n}     # exactly n occurrences of preceding element
{n,}    # n or more occurrences of preceding element
{n,m}   # n or up to m occurrences of preceding element

Greedy vs lazy matching#

Greedy quantifiers start by matching everything at first, and back off a character at a time only when it’s obvious that the match will not succeed. This is called backtracking and can have a negative impact on performance.

Lazy quantifiers will prefer the shortest possible match and will increase the number of characters only if the current number fails to match.

.*      # longest anything
.+      # longest something
.*?     # shortest anything
.+?     # shortest something

For example a.*?b matches a b and ab, whereas a.+?b matches only a b but not ab.

Tip

It’s best to use .* sparingly and prefer character classes that are more selective (for improved performance and less false positives).

Consider the example of matching the string inside double quotes. The naive ".*" does not work. Indeed, given the input a "b" c "d" e, the greedy matching will return as much as possible, namely "b" c "d". The lazy ".?" would return "b" and "d". Best practice is to make use of the negated character class "[^"]*".