The re.compile()
returns a regex objects, which takes the .group()
method to find the first match in a string, and the .findall()
method to find a list of all text matches in a string.
These are analogous to a typical find feature.
.sub()
is therefore analagous to the replace feature.
In [3]:
import re
namesRegex = re.compile(r'Agent \w+') # Match Agent and 1 or more words
print(namesRegex.findall('Agent Alice gave the secret documents to Agent Bob.')) # List matches
print(namesRegex.sub('REDACTED', 'Agent Alice gave the secret documents to Agent Bob.')) # Replace every match
You can also partially replace a match using a group, using placeholders like \1
.
In [9]:
import re
namesRegex = re.compile(r'Agent (\w)\w*') # Seperate the first letter into its own group, and match 0 or more words
print(namesRegex.findall('Agent Alice gave the secret documents to Agent Bob.')) # Will only return group 1 matches, not searching for entire strings
print(namesRegex.sub(r'Agent \1***', 'Agent Alice gave the secret documents to Agent Bob.')) # Replace matches with group 1 matches
This is basically a find and replace feature with regex.
Regex objects also have a re.verbose
argument, to allow multline line comments for complicated regex patterns, helping readabilitiy.
In [10]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) ) # area code (without parenthesis with dash, with parenthesis without dash )
- # first dash
\d\d\d # first 3 digits
- # second dash
\d\d\d\d # last 4 digits
\sx\d{2,4} # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.VERBOSE) # Allows multiline regex strings that ignore newlines, allowing for new comments/documentation on every line.
The re.compile()
function can only take one additional parameter, so if you wanted to use re.I
to ignore cases, re.DOTALL
to allow .*
to see newlines, and re.VERBOSE
to use multiline regex, you have to apply them with bitwise OR
;|
.
In [ ]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) ) # area code (without parenthesis with dash, with parenthesis without dash )
- # first dash
\d\d\d # first 3 digits
- # second dash
\d\d\d\d # last 4 digits
\sx\d{2,4} # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.I | re.DOTALL | re.VERBOSE) # Activites ignorecase, dotall, and verbose arguments simultaneously.
This syntax is from old code, and does not typically apply for other functions, just re.compile()
.
.sub
regex method will substitute matches with some other text.\1
, \2
, and so on will substitute group 1, 2, etc into the regex pattern.re.VERBOSE
lets you add whitespace and comments to the regex string passed to re.compile()
(even in raw strings.)re.compile()
, like re.DOTALL
, re.IGNORECASE
, and re.VERBOSE
) combine them with the |
bitwise operator.