Regular Expressions

Contents

Regular Expressions#

Regular expressions, shortened as regexp, can be thought of as special character sequences to describe pattern matching. They are similar to globbing and wildcard matching (eg ls *.txt), but not identical in syntax and offer more advanced features.

Regexp can be extremely useful in a variety of situations, from data analysis (extraction, cleaning, parsing), over powerful find and replace operations in text and code, to command-line tools like sed and grep. Despite the fact that it’s not necessarily an everyday topic for all scientists, we believe that it’s valuable to know the basic concepts and be able to spot if a problem at hand can be solved with regular expressions.

Beware that there are various regexp engines that can have slightly different syntaxes and behaviors. So depending on the programming language or library at hand, the specific syntax and supported features can vary. The Linux module has already presented the POSIX regexp. Another well-known implementation is the Perl Compatible Regular Expressions (PCRE) library written in C. The Python standard library contains the re module, that is not fully PCRE-compatible. The third-party regex package brings additional features and improved Unicode support.

Index#