POSIX regular expressions#

It is easiest to think about regular expressions that they specify certain textual patterns. For example the expression: t[io]m_[0-9]

This is pattern standing for all words starting with t, followed either by an i or an o, then afterwards an m, an underscore _ and a single digit number:

  • tim_0, tom_1, tom_0, tim_9, …

Often regular expressions are applied to a search space which is most often a line of text but in principle could be any string. The question is, whether the specified pattern is contained in that string, or if the string matches the regular expression. In this sense atom_23 and Hi I' am tim_3! both match, while for example int time_0; does not.

The POSIX regular expression language is built of blocks actually matching stuff and quantifiers changing the matching behavior of such blocks.

Characters#

Most characters just match themselves and nothing else. But of course there are some exceptions, those are called meta characters.

Expression

Meaning

a

matches a, this hold true for almost all characters

^

matches the beginning of the search space

$

matches the end of the search space

.

matches any character

\.

matches an actual ., see escaping

( ) [ ] { }

meta-characters used for groups, classes and quantifiers, see below

* + ?

meta-characters used for quantifiers, see below

Note

The POSIX basic regular expression flavour uses all parenthesis as normal characters matching themselves. The meta-character parenthesis have to escaped with a backslash, i.e. \{ instead of }. Some tools ere not very strict about this distinction an try to guess if a regex is basic or extended.

Character classes#

Expression

Character set

[:digit:]

a single digit 0 to 9

[:lower:]

characters a to z

[:upper:]

characters A to Z

[:alpha:]

characters from [:lower:] and [:upper:]

[:alnum:]

characters from [:alpha:] and [:digit:]

[:space:]

white spaces

[:punct:]

punctuation characters

There are more classes defined and actually their exact definition depends on the systems locale.

Quantifiers#

Quantifiers allow repetition of previous blocks, either single characters, character classes or groups (see below).

Expression

Equivalent

Meaning

{m,n}

The preceding expression is repeated at least n times and at most m times. n can be left out, (i.e. {m,}) which means \(\text{n}=\infty\)

{n}

{n,n}

The preceding expression matches exactly n times

*

{0,}

The preceding expression matches arbitrarily often, maybe not at all

+

{1,}

The preceding expression matches arbitrarily often, but at least one time

?

{0,1}

The preceding expression matches at most one time

Greedy and reluctant matches#

Of the variable quantifiers above {m,n}, + and * are greedy quantifiers. They try to consume as many characters as possible. For example the regular expression ".*" matches the whole sentence "Run", she said, "I know regular expressions" including the , she said,. To avoid this behavior the regular expression ".*?" has to used. This results in two matches ("Run" and "I know regular expressions"). This make the quantifiers lazy or reluctant, trying to match as few characters as possible.

|<----------------------------------------->|  greedy match  
"Run", she said, "I know regular expressions"
|<->|            |<------------------------>|  reluctant matches

Note

Most regex engines do not consider overlapping matches. They linearly search through the given data and ignore previous matches. That is why ", she said, " is not considered as a match in the previous example.

Alternatives#

Custom character classes can be defined by using square brackets. Inside square brackets ^, $ and . loose there special meaning and match themselves. However a leading ^ is used for negation. A pipe | is used for alternatives between sub expressions.

Expression

Meaning

[\(class\)]

characters from predefined \(class\)

[aeiou]

characters a, e, i, o and u

[a-f]

characters a to f

[-a-f]

characters a to f and -

[^<>]

any character except < or >

\(\text{expr}_A\) | \(\text{expr}_B\)

match \(\text{expr}_A\) or \(\text{expr}_B\)

Groups#

Subexpressions can be grouped and used in back references or to apply quantifiers. A group is marked by parenthesis about the subexpression. (Abc){3} is a regular expression matching AbcAbcAbc. This is different from [Abc]{3} matches any string of length 3 just containing the characters A, b and c, e.g. Acc.

Moreover the groups in a regular expression are captured. This means, they are stored in the order of their appearance and can be referenced in another location of the expression using their number. For example:

(foo|bar) == \1

matches foo == foo and bar == bar but not foo == bar or bar == foo.

Note

Implementing back references is makes the implementation and of matching algorithms and much more complex and also increases the runtime. Therefore a second grouping syntax exists which avoids capturing. This may should be used if back references are not required (?:\(\text{expr}\)). So in in the example above (?:Abc){3} preferable over (Abc){3}.