Regular expressions (regex or regexp) are search patterns described as strings. They allow pattern matching in arbitrary strings.
Regex are implemented in Python's re
module.
The re
module operates via two objects:
In [1]:
import re
r = re.compile("abc")
type(r), r
Out[1]:
In [4]:
m = r.search("aabcd")
type(m), m
Out[4]:
or None
is returned when no matching pattern was found in the string
In [5]:
m2 = r.search("def")
m2
The match object's span
method returns the starting and ending index of the match:
In [6]:
m = r.search("aabcde")
m.span()
Out[6]:
In [7]:
m.start(), m.end()
Out[7]:
these can be directly used to extract the matching substring:
In [8]:
s = "aabcdef"
m = r.search(s)
s[m.start():m.end()]
Out[8]:
Pattern objects do not have to be created but compilation offers some speed-up.
In [9]:
re.search("ab", "abcd")
Out[9]:
Below we show the most commonly used regex types. For a full list see the official documentation here.
We will not compile the regular expressions for the sake of brevity.
For capturing a range of characters, use []
:
In [ ]:
re.search("[bB]", "abc")
In [ ]:
re.search("[bB]", "aBb")
In [ ]:
re.search("a?bc", "bc")
In [ ]:
re.search("a?bc", "aabc")
*
matches zero or more times in a greedy manner (match as many as possible):
In [ ]:
re.search("a*bc", "aaabc")
+
matches one or more times:
In [ ]:
re.search("a+bc", "daaaabc")
{N}
will match exactly $N$ times:
In [10]:
re.search("a{3}bc", "aabc")
In [11]:
re.search("a{3}bc", "aaaabc")
Out[11]:
{N,M}
will match at least $N$ and at most $M$ times:
In [ ]:
re.search("a{3,5}bc", "aaaabc")
In [ ]:
re.search("a{3,5}bc", "aaaaabc")
In [ ]:
re.search("a.c", "abc")
In [ ]:
print(re.search("a.c", "ac"))
^
(Caret) matches the beginning of the string and in MULTILINE mode, immediately after each newline:
In [12]:
re.search("^a", "abc"), re.search("^a", "bc\na")
Out[12]:
In [13]:
re.search("^a", "bc\nab", re.MULTILINE)
Out[13]:
$
matches the end of the string or before newline in MULTILINE mode:
In [ ]:
re.search("c$", "abc")
In [ ]:
re.search("a$", "aba\nc", re.MULTILINE)
In [ ]:
re.search("[A-F]{3}", "abBCDEef")
[A-Za-z]
matches every English letter:
In [ ]:
re.search("[A-Za-z]+", "abAaz12")
[0-9]
matches digits:
In [ ]:
re.search("[0-9]{3}", "ab12345cd")
-
needs to be escaped if we want to include it in the character range:
In [ ]:
re.search("[0-9\-]+", "1-2")
In [ ]:
re.search("\s+", "ab \t\n\n")
\S
(capital S) matches anything else:
In [ ]:
re.search("\S+", "ab \t\n\n")
\w
matches any Unicode word character that can be part of a word in any language, and \W
matches anything else:
In [ ]:
re.search("\w+", "tükőrfúrógép")
In [ ]:
re.search("\w+", "tükőr fúrógép")
In [ ]:
patt = re.compile("([hH]ell[oó]) (.*)!")
match = patt.search("Hello people!")
In [ ]:
match.group()
In [ ]:
match.groups()
In [ ]:
match.group(0)
In [ ]:
match.group(1)
In [ ]:
match.group(2)
Groups can also be named, in this case they can be retrieved in a dict maping group names to matched substrings using groupdict:
In [ ]:
patt = re.compile("(?P<greeting>[hH]ell[oó]) (?P<name>.*)!")
match = patt.search("Hello people!")
match.group("name")
In [ ]:
match.groupdict()
In [ ]:
re.match("ab", "abcd")
In [ ]:
print(re.match("ab", "zabcd"))
re.findall
matches every occurrence of a pattern in a string. Unlike most other methods, findall
directly returns the string patterns instead of patter objects.
In [ ]:
re.findall("[AaBb]+", "ab Abcd")
re.findall
returns an iterator that iterates over every match:
In [ ]:
for match in re.finditer("[AaBb]+", "ab Abcd"):
print(match)
re.sub
replaces an occurrence:
In [ ]:
re.sub("[Aa]", "b", "acA")
In [ ]:
re.split("\s+", "words with\t whitespace")