Regex in Python#

String matching without regex#

For simple substring matching you don’t need any regular expressions or additional modules.

data = "1.0 m"

if "m" in data:
    print("String contains an 'm'")

The built-in string method find even allows to return the first position where the substring was found.

>>> "John Doe".find("o")  # returns index of first match
1
>>> "John Doe".find("o", 2)  # idem, but starting at index 2
6
>>> "John Doe".find("a")  # returns -1 if no match was found
-1

String substitutions with regexp#

Regexp can be used to replace portions of a string that match a given pattern. For example to replace any digit by #.

import re

re.sub(r"\d", "#", "abc123")  # --> abc###

Format validation with regex#

A typical usecase of regex is to check that a given string is in the valid format.

import re

measurements = ["1.0 m", "1.000 m", "1m", "1 m", "1  m"]

for measurement in measurements:
    assert re.match(r"[\d.]*\s*m", measurement)

Best practice would be to define the regexp once outside the loop with re.compile() and add explanations using the re.X verbose flag to ignore whitespace and comments.

import re

measurements = ["1.0 m", "1.000 m", "1m", "1 m", "1  m"]

re_valid_format = re.compile(
    r"""     # raw string to treat \ as literal
    ^        # beginning of line
    [\d]+    # one or more digits
    (\.\d*)? # optional dot with decimals
    \s*      # any number of whitespace
    m        # the unit must be m
    $        # end of line
""",
    re.X,
)

for measurement in measurements:
    assert re_valid_format.match(measurement)

Data parsing with regexp#

Another important application of regexp is to parse a given string and extract the data from it. This can be done elegantly with named capturing groups (?P<name>regex).

import re

measurements = ["1.0 m", "1.000  m", "100cm", "1000 mm"]

re_parse_value_unit = re.compile(
    r"""
    (?P<value>[\d.]*)  # numerical value
    \s*                # whitespace
    (?P<unit>[\w]+)    # unit
""",
    re.X,
)

for m in measurements:
    data = re_parse_value_unit.match(m).groupdict()
    print(data["value"], data["unit"])