Regex Patterns#
Regular expressions are constructed from a sequence of atoms, that can consist of literal characters or meta-characters with special meaning. Let’s start by looking into the list of commonly used characters and regexp patterns.
See also
Python regular expression howto
Python
reregular expression syntax
Literal characters#
Most characters simply match themselves.
a # matches 'cat' but not 'dog' or 'CAT'
Non-printable or white-space characters can be matched with their backslash notation.
\t # matches tab
\n # matches newline
\r # matches carriage return
Meta-characters#
The meta-characters . * + ? ^ $ { } [ ] \ | ( ) have a special meaning.
. # matches any character except newline
a|b # matches 'a' or 'b'
() # used for precedence or capturing groups
They need to be escaped to match the literal character.
\. # matches any dot character
\\ # matches any backslash character
Character classes#
A character class specifies the set of characters that should be matched.
[chars] # any char from given set or range
[^chars] # any char not in given set or range
Only ] ^ - \ are treated as special in character classes, the rest are literal (eg []x] matches ] or x). The - is treated as literal if it’s the first or last character.
Shorthand character classes#
Abbreviations exist for frequently used character classes.
\d # any digit = [0-9]
\D # any non-digit = [^0-9]
\w # any alphanumeric plus underscore = [a-zA-Z0-9_]
\W # any non-alphanumeric = [^a-zA-Z0-9_]
\s # any whitespace char = [ \f\t\n\r\v] (space, form feed, tab, newline, carriage return, vertical tab)
\S # any non-whitespace char
Anchors and other zero-width assertions#
Zero-width assertions do not match any specific character, but rather a position before, after or between characters.
^ # beginning of a line
$ # end of a line
\A # start of string
\Z # end of string
\b # any boundary between a word char \w and a non-word char \W (eg ',.!?)
\B # any non-word-boundary
Quantifiers#
Quantifiers allow to specify how often a given portion of the regexp can (or must) be repeated.
* # zero or more occurrences of preceding element
? # zero or one occurrence of preceding element
+ # one or more occurrences of preceding element
{n} # exactly n occurrences of preceding element
{n,} # n or more occurrences of preceding element
{n,m} # n or up to m occurrences of preceding element
Greedy vs lazy matching#
Greedy quantifiers start by matching everything at first, and back off a character at a time only when it’s obvious that the match will not succeed. This is called backtracking and can have a negative impact on performance.
Lazy quantifiers will prefer the shortest possible match and will increase the number of characters only if the current number fails to match.
.* # longest anything
.+ # longest something
.*? # shortest anything
.+? # shortest something
For example a.*?b matches a b and ab, whereas a.+?b matches only a b but not ab.
Tip
It’s best to use .* sparingly and prefer character classes that are more selective (for improved performance and less false positives).
Consider the example of matching the string inside double quotes. The naive ".*" does not work. Indeed, given the input a "b" c "d" e, the greedy matching will return as much as possible, namely "b" c "d". The lazy ".?" would return "b" and "d". Best practice is to make use of the negated character class "[^"]*".