Lesson 28:

Regex .sub() Method and Verbose Mode

The re.compile() returns a regex objects, which takes the .group() method to find the first match in a string, and the .findall() method to find a list of all text matches in a string.

These are analogous to a typical find feature.

.sub() is therefore analagous to the replace feature.


In [3]:
import re

namesRegex = re.compile(r'Agent \w+') # Match Agent and 1 or more words

print(namesRegex.findall('Agent Alice gave the secret documents to Agent Bob.')) # List matches

print(namesRegex.sub('REDACTED', 'Agent Alice gave the secret documents to Agent Bob.')) # Replace every match


['Agent Alice', 'Agent Bob']
REDACTED gave the secret documents to REDACTED.

You can also partially replace a match using a group, using placeholders like \1.


In [9]:
import re

namesRegex = re.compile(r'Agent (\w)\w*') # Seperate the first letter into its own group, and match 0 or more words

print(namesRegex.findall('Agent Alice gave the secret documents to Agent Bob.')) # Will only return group 1 matches, not searching for entire strings

print(namesRegex.sub(r'Agent \1***', 'Agent Alice gave the secret documents to Agent Bob.')) # Replace matches with group 1 matches


['A', 'B']
Agent A*** gave the secret documents to Agent B***.

This is basically a find and replace feature with regex.

Regex objects also have a re.verbose argument, to allow multline line comments for complicated regex patterns, helping readabilitiy.


In [10]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) )   # area code (without parenthesis with dash, with parenthesis without dash )
-                        # first dash
\d\d\d                   # first 3 digits
-                        # second dash
\d\d\d\d                 # last 4 digits
\sx\d{2,4}               # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.VERBOSE) # Allows multiline regex strings that ignore newlines, allowing for new comments/documentation on every line.

The re.compile() function can only take one additional parameter, so if you wanted to use re.I to ignore cases, re.DOTALL to allow .* to see newlines, and re.VERBOSE to use multiline regex, you have to apply them with bitwise OR;|.


In [ ]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) )   # area code (without parenthesis with dash, with parenthesis without dash )
-                        # first dash
\d\d\d                   # first 3 digits
-                        # second dash
\d\d\d\d                 # last 4 digits
\sx\d{2,4}               # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.I | re.DOTALL | re.VERBOSE) # Activites ignorecase, dotall, and verbose arguments simultaneously.

This syntax is from old code, and does not typically apply for other functions, just re.compile().

Recap

  • The .sub regex method will substitute matches with some other text.
  • Using \1, \2, and so on will substitute group 1, 2, etc into the regex pattern.
  • Passing re.VERBOSE lets you add whitespace and comments to the regex string passed to re.compile() (even in raw strings.)
  • If you want to pass multiple arguments to re.compile(), like re.DOTALL, re.IGNORECASE, and re.VERBOSE) combine them with the | bitwise operator.