The re.compile() returns a regex objects, which takes the .group() method to find the first match in a string, and the .findall() method to find a list of all text matches in a string.
These are analogous to a typical find feature.
.sub() is therefore analagous to the replace feature.
In [3]:
import re
namesRegex = re.compile(r'Agent \w+') # Match Agent and 1 or more words
print(namesRegex.findall('Agent Alice gave the secret documents to Agent Bob.')) # List matches
print(namesRegex.sub('REDACTED', 'Agent Alice gave the secret documents to Agent Bob.')) # Replace every match
You can also partially replace a match using a group, using placeholders like \1.
In [9]:
import re
namesRegex = re.compile(r'Agent (\w)\w*') # Seperate the first letter into its own group, and match 0 or more words
print(namesRegex.findall('Agent Alice gave the secret documents to Agent Bob.')) # Will only return group 1 matches, not searching for entire strings
print(namesRegex.sub(r'Agent \1***', 'Agent Alice gave the secret documents to Agent Bob.')) # Replace matches with group 1 matches
This is basically a find and replace feature with regex.
Regex objects also have a re.verbose argument, to allow multline line comments for complicated regex patterns, helping readabilitiy.
In [10]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) ) # area code (without parenthesis with dash, with parenthesis without dash )
- # first dash
\d\d\d # first 3 digits
- # second dash
\d\d\d\d # last 4 digits
\sx\d{2,4} # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.VERBOSE) # Allows multiline regex strings that ignore newlines, allowing for new comments/documentation on every line.
The re.compile() function can only take one additional parameter, so if you wanted to use re.I to ignore cases, re.DOTALL to allow .* to see newlines, and re.VERBOSE to use multiline regex, you have to apply them with bitwise OR;|.
In [ ]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) ) # area code (without parenthesis with dash, with parenthesis without dash )
- # first dash
\d\d\d # first 3 digits
- # second dash
\d\d\d\d # last 4 digits
\sx\d{2,4} # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.I | re.DOTALL | re.VERBOSE) # Activites ignorecase, dotall, and verbose arguments simultaneously.
This syntax is from old code, and does not typically apply for other functions, just re.compile().
.sub regex method will substitute matches with some other text.\1, \2, and so on will substitute group 1, 2, etc into the regex pattern.re.VERBOSE lets you add whitespace and comments to the regex string passed to re.compile() (even in raw strings.)re.compile(), like re.DOTALL, re.IGNORECASE, and re.VERBOSE) combine them with the | bitwise operator.