Regex in Python#
String matching without regex#
For simple substring matching you don’t need any regular expressions or additional modules.
data = "1.0 m"
if "m" in data:
print("String contains an 'm'")
The built-in string method find
even allows to return the first position where the substring was found.
>>> "John Doe".find("o") # returns index of first match
1
>>> "John Doe".find("o", 2) # idem, but starting at index 2
6
>>> "John Doe".find("a") # returns -1 if no match was found
-1
String substitutions with regexp#
Regexp can be used to replace portions of a string that match a given pattern. For example to replace any digit by #
.
import re
re.sub(r"\d", "#", "abc123") # --> abc###
Format validation with regex#
A typical usecase of regex is to check that a given string is in the valid format.
import re
measurements = ["1.0 m", "1.000 m", "1m", "1 m", "1 m"]
for measurement in measurements:
assert re.match(r"[\d.]*\s*m", measurement)
Best practice would be to define the regexp once outside the loop with re.compile()
and add explanations using the re.X
verbose flag to ignore whitespace and comments.
import re
measurements = ["1.0 m", "1.000 m", "1m", "1 m", "1 m"]
re_valid_format = re.compile(
r""" # raw string to treat \ as literal
^ # beginning of line
[\d]+ # one or more digits
(\.\d*)? # optional dot with decimals
\s* # any number of whitespace
m # the unit must be m
$ # end of line
""",
re.X,
)
for measurement in measurements:
assert re_valid_format.match(measurement)
Data parsing with regexp#
Another important application of regexp is to parse a given string and extract the data from it. This can be done elegantly with named capturing groups (?P<name>regex)
.
import re
measurements = ["1.0 m", "1.000 m", "100cm", "1000 mm"]
re_parse_value_unit = re.compile(
r"""
(?P<value>[\d.]*) # numerical value
\s* # whitespace
(?P<unit>[\w]+) # unit
""",
re.X,
)
for m in measurements:
data = re_parse_value_unit.match(m).groupdict()
print(data["value"], data["unit"])