Regex Patterns#
Regular expressions are constructed from a sequence of atoms, that can consist of literal characters or metacharacters with special meaning. Let’s start by looking into the list of commonly used characters and regexp patterns.
See also
Python regular expression howto
Python
re
regular expression syntax
Literal characters#
Most characters simply match themselves.
a # matches 'cat' but not 'dog' or 'CAT'
Non-printable or whitespace characters can be matched with their backslash notation.
\t # matches tab
\n # matches newline
\r # matches carriage return
Metacharacters#
The metacharacters . * + ? ^ $ { } [ ] \ | ( )
have a special meaning.
. # matches any character except newline
a|b # matches 'a' or 'b'
() # used for precedence or capturing groups
They need to be escaped to match the literal character.
\. # matches any dot character
\\ # matches any backslash character
Character classes#
A character class specifies the set of characters that should be matched.
[chars] # any char from given set or range
[^chars] # any char not in given set or range
Only ] ^ - \
are treated as special in character classes, the rest are literal (eg []x]
matches ]
or x
). The -
is treated as literal if it’s the first or last character.
Shorthand character classes#
Abbreviations exist for frequently used character classes.
\d # any digit = [0-9]
\D # any non-digit = [^0-9]
\w # any alphanumeric plus underscore = [a-zA-Z0-9_]
\W # any non-alphanumeric = [^a-zA-Z0-9_]
\s # any whitespace char = [ \f\t\n\r\v] (space, form feed, tab, newline, carriage return, vertical tab)
\S # any non-whitespace char
Anchors and other zero-width assertions#
Zero-width assertions do not match any specific character, but rather a position before, after or between characters.
^ # beginning of a line
$ # end of a line
\A # start of string
\Z # end of string
\b # any boundary between a word char \w and a non-word char \W (eg ',.!?)
\B # any non-word-boundary
Quantifiers#
Quantifiers allow to specify how often a given portion of the regexp can (or must) be repeated.
* # zero or more occurrences of preceding element
? # zero or one occurrence of preceding element
+ # one or more occurrences of preceding element
{n} # exactly n occurrences of preceding element
{n,} # n or more occurrences of preceding element
{n,m} # n or up to m occurrences of preceding element
Greedy vs lazy matching#
Greedy quantifiers start by matching everything at first, and back off a character at a time only when it’s obvious that the match will not succeed. This is called backtracking and can have a negative impact on performance.
Lazy quantifiers will prefer the shortest possible match and will increase the number of characters only if the current number fails to match.
.* # longest anything
.+ # longest something
.*? # shortest anything
.+? # shortest something
For example a.*?b
matches a b
and ab
, whereas a.+?b
matches only a b
but not ab
.
Tip
It’s best to use .*
sparingly and prefer character classes that are more selective (for improved performance and less false positives).
Consider the example of matching the string inside double quotes. The naive ".*"
does not work. Indeed, given the input a "b" c "d" e
, the greedy matching will return as much as possible, namely "b" c "d"
. The lazy ".?"
would return "b"
and "d"
. Best practice is to make use of the negated character class "[^"]*"
.