Regular expressions

Regular expressions (regex or regexp) are search patterns described as strings. They allow pattern matching in arbitrary strings.

Regex are implemented in Python's re module.

The re module operates via two objects:

  1. pattern objects, which are a compiled regular expressions and
  2. match objects that describe successful pattern matches.

In [1]:
import re

r = re.compile("abc")
type(r), r


Out[1]:
(_sre.SRE_Pattern, re.compile(r'abc', re.UNICODE))

In [4]:
m = r.search("aabcd")
type(m), m


Out[4]:
(_sre.SRE_Match, <_sre.SRE_Match object; span=(1, 4), match='abc'>)

or None is returned when no matching pattern was found in the string


In [5]:
m2 = r.search("def")
m2

The match object's span method returns the starting and ending index of the match:


In [6]:
m = r.search("aabcde")
m.span()


Out[6]:
(1, 4)

In [7]:
m.start(), m.end()


Out[7]:
(1, 4)

these can be directly used to extract the matching substring:


In [8]:
s = "aabcdef"
m = r.search(s)

s[m.start():m.end()]


Out[8]:
'abc'

Pattern objects do not have to be created but compilation offers some speed-up.


In [9]:
re.search("ab", "abcd")


Out[9]:
<_sre.SRE_Match object; span=(0, 2), match='ab'>

Basic regular expressions

Below we show the most commonly used regex types. For a full list see the official documentation here.

We will not compile the regular expressions for the sake of brevity.

For capturing a range of characters, use []:


In [ ]:
re.search("[bB]", "abc")

In [ ]:
re.search("[bB]", "aBb")

Qualifiers

Qualifiers control how many times a pattern is searched for.

? matches zero or one time:


In [ ]:
re.search("a?bc", "bc")

In [ ]:
re.search("a?bc", "aabc")

* matches zero or more times in a greedy manner (match as many as possible):


In [ ]:
re.search("a*bc", "aaabc")

+ matches one or more times:


In [ ]:
re.search("a+bc", "daaaabc")

{N} will match exactly $N$ times:


In [10]:
re.search("a{3}bc", "aabc")

In [11]:
re.search("a{3}bc", "aaaabc")


Out[11]:
<_sre.SRE_Match object; span=(1, 6), match='aaabc'>

{N,M} will match at least $N$ and at most $M$ times:


In [ ]:
re.search("a{3,5}bc", "aaaabc")

In [ ]:
re.search("a{3,5}bc", "aaaaabc")

Special characters

. matches any character besides newline:


In [ ]:
re.search("a.c", "abc")

In [ ]:
print(re.search("a.c", "ac"))

^ (Caret) matches the beginning of the string and in MULTILINE mode, immediately after each newline:


In [12]:
re.search("^a", "abc"), re.search("^a", "bc\na")


Out[12]:
(<_sre.SRE_Match object; span=(0, 1), match='a'>, None)

In [13]:
re.search("^a", "bc\nab", re.MULTILINE)


Out[13]:
<_sre.SRE_Match object; span=(3, 4), match='a'>

$ matches the end of the string or before newline in MULTILINE mode:


In [ ]:
re.search("c$", "abc")

In [ ]:
re.search("a$", "aba\nc", re.MULTILINE)

Character ranges

[A-F] matches every character between A and F:


In [ ]:
re.search("[A-F]{3}", "abBCDEef")

[A-Za-z] matches every English letter:


In [ ]:
re.search("[A-Za-z]+", "abAaz12")

[0-9] matches digits:


In [ ]:
re.search("[0-9]{3}", "ab12345cd")

- needs to be escaped if we want to include it in the character range:


In [ ]:
re.search("[0-9\-]+", "1-2")

Character classes

Some character classes are predefined such as ascii letters or whitespaces.

\s matches any Unicode whitespace:


In [ ]:
re.search("\s+", "ab \t\n\n")

\S (capital S) matches anything else:


In [ ]:
re.search("\S+", "ab \t\n\n")

\w matches any Unicode word character that can be part of a word in any language, and \W matches anything else:


In [ ]:
re.search("\w+", "tükőrfúrógép")

In [ ]:
re.search("\w+", "tükőr fúrógép")

Capture groups

Patterns may contain groups, marked by parentheses. Groups can be accessed by their indices, or all of them can be retrieved by the groups method:


In [ ]:
patt = re.compile("([hH]ell[oó]) (.*)!")
match = patt.search("Hello people!")

In [ ]:
match.group()

In [ ]:
match.groups()

In [ ]:
match.group(0)

In [ ]:
match.group(1)

In [ ]:
match.group(2)

Groups can also be named, in this case they can be retrieved in a dict maping group names to matched substrings using groupdict:


In [ ]:
patt = re.compile("(?P<greeting>[hH]ell[oó]) (?P<name>.*)!")
match = patt.search("Hello people!")
match.group("name")

In [ ]:
match.groupdict()

Other re methods

re.match matches only at the beginning of the string


In [ ]:
re.match("ab", "abcd")

In [ ]:
print(re.match("ab", "zabcd"))

re.findall matches every occurrence of a pattern in a string. Unlike most other methods, findall directly returns the string patterns instead of patter objects.


In [ ]:
re.findall("[AaBb]+", "ab Abcd")

re.findall returns an iterator that iterates over every match:


In [ ]:
for match in re.finditer("[AaBb]+", "ab Abcd"):
    print(match)

re.sub replaces an occurrence:


In [ ]:
re.sub("[Aa]", "b", "acA")
`re.split` splits a string at every pattern match

In [ ]:
re.split("\s+", "words  with\t whitespace")